Question

使用Appache Lucene TokenStream删除停用词导致错误：

TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.

我使用此代码：

public static String removeStopWords(String string) throws IOException {
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_47, new StringReader(string));
    TokenFilter tokenFilter = new StandardFilter(Version.LUCENE_47, tokenStream);
    TokenStream stopFilter = new StopFilter(Version.LUCENE_47, tokenFilter, StandardAnalyzer.STOP_WORDS_SET);
    StringBuilder stringBuilder = new StringBuilder();

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

    while(stopFilter.incrementToken()) {
        if(stringBuilder.length() > 0 ) {
            stringBuilder.append(" ");
        }

        stringBuilder.append(token.toString());
    }

    stopFilter.end();
    stopFilter.close();

    return stringBuilder.toString();
}

但是你可以看到我从不调用reset（）或close（）。

那我为什么会收到这个错误？

Answer 1

我从不调用reset（）或close（）。

嗯，是你的问题。如果您想阅读TokenStream javadoc，您会发现以下内容：

新TokenStream API的工作流程如下：



向TokenStream添加/获取属性的TokenFilter / AttributeSource的实例化。

消费者致电TokenStream#reset()

...

我只需要在代码中添加reset()一行，就可以了。

...    
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();   // I added this 
while(stopFilter.incrementToken()) {
...

Answer 2

重复使用相同的Tokenizer时遇到错误。原因只是在评论中。该解决方案是设置新的阅读器或创建新的令牌生成器。

  /** Expert: Set a new reader on the Tokenizer.  Typically, an
   *  analyzer (in its tokenStream method) will use
   *  this to re-use a previously created tokenizer. */
  public final void setReader(Reader input) {
    if (input == null) {
      throw new NullPointerException("input must not be null");
    } else if (this.input != ILLEGAL_STATE_READER) {
      throw new IllegalStateException("TokenStream contract violation: close() call missing");
    }
    this.inputPending = input;
    setReaderTestPoint();
  }

Apache Lucene TokenStream合同违规

2 个答案: