Apache Lucene TokenStream合同违规

时间:2014-05-29 10:55:12

标签: java lucene

使用Appache Lucene TokenStream删除停用词 导致错误:

TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.

我使用此代码:

public static String removeStopWords(String string) throws IOException {
    TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_47, new StringReader(string));
    TokenFilter tokenFilter = new StandardFilter(Version.LUCENE_47, tokenStream);
    TokenStream stopFilter = new StopFilter(Version.LUCENE_47, tokenFilter, StandardAnalyzer.STOP_WORDS_SET);
    StringBuilder stringBuilder = new StringBuilder();

    CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

    while(stopFilter.incrementToken()) {
        if(stringBuilder.length() > 0 ) {
            stringBuilder.append(" ");
        }

        stringBuilder.append(token.toString());
    }

    stopFilter.end();
    stopFilter.close();

    return stringBuilder.toString();
}

但是你可以看到我从不调用reset()或close()。

那我为什么会收到这个错误?

2 个答案:

答案 0 :(得分:8)

  

我从不调用reset()或close()。

嗯,你的问题。如果您想阅读TokenStream javadoc,您会发现以下内容:

  

TokenStream API的工作流程如下:

     
      
  1. TokenStream添加/获取属性的TokenFilter / AttributeSource的实例化。
  2.   
  3. 消费者致电TokenStream#reset()
  4.   
  5. ...
  6.   

我只需要在代码中添加reset()一行,就可以了。

...    
CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
tokenStream.reset();   // I added this 
while(stopFilter.incrementToken()) {
...

答案 1 :(得分:0)

重复使用相同的Tokenizer时遇到错误。 原因只是在评论中。 该解决方案是设置新的阅读器或创建新的令牌生成器。

  /** Expert: Set a new reader on the Tokenizer.  Typically, an
   *  analyzer (in its tokenStream method) will use
   *  this to re-use a previously created tokenizer. */
  public final void setReader(Reader input) {
    if (input == null) {
      throw new NullPointerException("input must not be null");
    } else if (this.input != ILLEGAL_STATE_READER) {
      throw new IllegalStateException("TokenStream contract violation: close() call missing");
    }
    this.inputPending = input;
    setReaderTestPoint();
  }