" TokenStream合同违规:close()调用缺失"调用addDocument时

时间:2016-10-07 15:12:20

标签: java lucene

我使用 Lucene 的功能来构建一种简单的方法来匹配文本中的相似单词。

我的想法是在我的文字上运行Analyzer以提供TokenStream,并为每个令牌运行FuzzyQuery以查看我的索引中是否匹配。如果不是,我只需索引一个只包含新唯一字的新Document

这就是我得到的东西:

Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
    at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
    at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:411)
    at org.apache.lucene.analysis.standard.StandardAnalyzer$1.setReader(StandardAnalyzer.java:111)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:165)
    at org.apache.lucene.document.Field.tokenStream(Field.java:568)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:708)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:417)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:373)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1562)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1307)
    at org.myPackage.MyClass.addToIndex(MyClass.java:58)

此处的相关代码:

// Setup tokenStream based on StandardAnalyzer
TokenStream tokenStream = analyzer.tokenStream(TEXT_FIELD_NAME, new StringReader(input));
tokenStream = new StopFilter(tokenStream, EnglishAnalyzer.getDefaultStopSet());
tokenStream = new ShingleFilter(tokenStream, 3);
tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
...
// Iterate and process each token from the stream
while (tokenStream.incrementToken()) {
    CharTermAttribute charTerm = tokenStream.getAttribute(CharTermAttribute.class);
    processWord(charTerm.toString());
}
...
// Processing a word means looking for a similar one inside the index and, if not found, adding this one to the index
void processWord(String word) {
    ...
    if (DirectoryReader.indexExists(index)) {
        reader = DirectoryReader.open(index);
        IndexSearcher searcher = new IndexSearcher(reader);
        TopDocs searchResults = searcher.search(query, 1);
        if (searchResults.totalHits > 0) {
            Document foundDocument = searcher.doc(searchResults.scoreDocs[0].doc);
            super.processWord(foundDocument.get(TEXT_FIELD_NAME));
        } else {
            addToIndex(word);
        }
    } else {
        addToIndex(word);
    }
    ...
}
...
// Create a new Document to index the provided word
void addWordToIndex(String word) throws IOException {
    Document newDocument = new Document();
    newDocument.add(new TextField(TEXT_FIELD_NAME, new StringReader(word)));
    indexWriter.addDocument(newDocument);
    indexWriter.commit();
}

异常似乎告诉我在向索引添加内容之前应关闭TokenStream,但这对我来说并不合理,因为索引和TokenStream如何相关?我的意思是,索引只会收到一个包含Document的{​​{1}},而来自String的{​​{1}}应该是无关紧要的。

有关如何解决此问题的任何提示?

1 个答案:

答案 0 :(得分:1)

问题在于您重复使用IndexWriter尝试使用的同一分析器。您从该分析器打开TokenStream,然后尝试索引文档。需要分析该文档,但分析器发现旧的TokenStream仍处于打开状态,并抛出异常。

要修复它,您可以创建一个新的独立分析器来处理和测试字符串,而不是使用IndexWriter正在使用的分析器。