使用Appache Lucene TokenStream删除停用词 导致错误:
TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
我使用此代码:
public static String removeStopWords(String string) throws IOException {
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_47, new StringReader(string));
TokenFilter tokenFilter = new StandardFilter(Version.LUCENE_47, tokenStream);
TokenStream stopFilter = new StopFilter(Version.LUCENE_47, tokenFilter, StandardAnalyzer.STOP_WORDS_SET);
StringBuilder stringBuilder = new StringBuilder();
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
while(stopFilter.incrementToken()) {
if(stringBuilder.length() > 0 ) {
stringBuilder.append(" ");
}
stringBuilder.append(token.toString());
}
stopFilter.end();
stopFilter.close();
return stringBuilder.toString();
}
但是你可以看到我从不调用reset()或close()。
那我为什么会收到这个错误?
答案 0 :(得分:8)
我从不调用reset()或close()。
嗯,是你的问题。如果您想阅读TokenStream
javadoc,您会发现以下内容:
新
TokenStream
API的工作流程如下:
- 向
TokenStream
添加/获取属性的TokenFilter
/AttributeSource
的实例化。- 消费者致电
TokenStream#reset()
- ...
醇>
我只需要在代码中添加reset()
一行,就可以了。
...
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset(); // I added this
while(stopFilter.incrementToken()) {
...
答案 1 :(得分:0)
重复使用相同的Tokenizer时遇到错误。 原因只是在评论中。 该解决方案是设置新的阅读器或创建新的令牌生成器。
/** Expert: Set a new reader on the Tokenizer. Typically, an
* analyzer (in its tokenStream method) will use
* this to re-use a previously created tokenizer. */
public final void setReader(Reader input) {
if (input == null) {
throw new NullPointerException("input must not be null");
} else if (this.input != ILLEGAL_STATE_READER) {
throw new IllegalStateException("TokenStream contract violation: close() call missing");
}
this.inputPending = input;
setReaderTestPoint();
}