我需要在文本中找到最常用的术语。环顾四周,我创建了自己的Analyzer
子类并覆盖了它的createComponents
方法。
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new NGramTokenizer(Version.LUCENE_47, reader, 12, 12);
TokenStream filter = new LowerCaseFilter(Version.LUCENE_47, source);
try {
TokenStream tokenStream = tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
System.out.println("tokenStream " + tokenStream);
while (tokenStream.incrementToken()) {
//int startOffset = offsetAttribute.startOffset();
//int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
System.out.println("term = " + term);
}
} catch(Exception e) {
e.printStackTrace();
}
return new TokenStreamComponents(source, filter);
}
这就是我所说的:
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, rma);
StringReader sr = new StringReader(descProd1);
IndexWriter w = new IndexWriter(index, config);
LuceneUtil.addDoc(w, descProd1, "193398817");
rma.createComponents("content", sr);
w.close();
rma.close();
addDoc
方法:
public static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
当我运行此操作时,它会在此行上以java.lang.StackOverflowError
爆炸:
TokenStream tokenStream = tokenStream(fieldName, reader);
我是Lucene
的新手,所以我不确定自己是否在正确的道路上。我呢?
答案 0 :(得分:0)
tokenStream
来电createComponents
,您的createComponents
来电tokenStream
!所以你处在一个无限循环中!
为什么要在createComponents中读取流?只是做:
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new NGramTokenizer(Version.LUCENE_47, reader, 12, 12);
TokenStream filter = new LowerCaseFilter(Version.LUCENE_47, source);
return new TokenStreamComponents(source, filter);
}
然后配置您的编写器配置以使用您的分析器,一切都将在幕后完成。
答案 1 :(得分:0)
我是OP并且是Lucene
的新手。我的问题中的代码并没有走上正轨。继续搜索我拼凑了一些代码,用于查找最高频率的术语。这是:
// create an analyzer:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_47);
// create an index and add the text (strings) you want to analyze:
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, text1, "");
addDoc(w, text2, "");
addDoc(w, text3, "");
w.close();
// a comparator is needed for the HighFreqTerms.getHighFreqTerms method:
Comparator<TermStats> comparator = new Comparator<TermStats>() {
@Override
public int compare(TermStats o1, TermStats o2) {
if(o1.totalTermFreq > o2.totalTermFreq) {
return 1;
} else if(o2.totalTermFreq > o1.totalTermFreq) {
return -1;
}
return 0;
}
};
// find the highest frequency terms:
try {
TermStats ts[] = HighFreqTerms.getHighFreqTerms(reader, 50, fieldName, comparator);
for(int i=0; i<ts.length; i++) {
System.out.println(ts[i]);
}
} catch(Exception e) {
e.printStackTrace();
}