我使用 Lucene 的功能来构建一种简单的方法来匹配文本中的相似单词。
我的想法是在我的文字上运行Analyzer
以提供TokenStream
,并为每个令牌运行FuzzyQuery
以查看我的索引中是否匹配。如果不是,我只需索引一个只包含新唯一字的新Document
。
这就是我得到的东西:
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:411)
at org.apache.lucene.analysis.standard.StandardAnalyzer$1.setReader(StandardAnalyzer.java:111)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:165)
at org.apache.lucene.document.Field.tokenStream(Field.java:568)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:708)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:417)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:373)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1562)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1307)
at org.myPackage.MyClass.addToIndex(MyClass.java:58)
此处的相关代码:
// Setup tokenStream based on StandardAnalyzer
TokenStream tokenStream = analyzer.tokenStream(TEXT_FIELD_NAME, new StringReader(input));
tokenStream = new StopFilter(tokenStream, EnglishAnalyzer.getDefaultStopSet());
tokenStream = new ShingleFilter(tokenStream, 3);
tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
...
// Iterate and process each token from the stream
while (tokenStream.incrementToken()) {
CharTermAttribute charTerm = tokenStream.getAttribute(CharTermAttribute.class);
processWord(charTerm.toString());
}
...
// Processing a word means looking for a similar one inside the index and, if not found, adding this one to the index
void processWord(String word) {
...
if (DirectoryReader.indexExists(index)) {
reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs searchResults = searcher.search(query, 1);
if (searchResults.totalHits > 0) {
Document foundDocument = searcher.doc(searchResults.scoreDocs[0].doc);
super.processWord(foundDocument.get(TEXT_FIELD_NAME));
} else {
addToIndex(word);
}
} else {
addToIndex(word);
}
...
}
...
// Create a new Document to index the provided word
void addWordToIndex(String word) throws IOException {
Document newDocument = new Document();
newDocument.add(new TextField(TEXT_FIELD_NAME, new StringReader(word)));
indexWriter.addDocument(newDocument);
indexWriter.commit();
}
异常似乎告诉我在向索引添加内容之前应关闭TokenStream
,但这对我来说并不合理,因为索引和TokenStream
如何相关?我的意思是,索引只会收到一个包含Document
的{{1}},而来自String
的{{1}}应该是无关紧要的。
有关如何解决此问题的任何提示?
答案 0 :(得分:1)
问题在于您重复使用IndexWriter尝试使用的同一分析器。您从该分析器打开TokenStream,然后尝试索引文档。需要分析该文档,但分析器发现旧的TokenStream仍处于打开状态,并抛出异常。
要修复它,您可以创建一个新的独立分析器来处理和测试字符串,而不是使用IndexWriter
正在使用的分析器。