使用apache lucene删除停用词时出现异常

时间:2017-08-20 15:30:04

标签: lucene stop-words

我正在使用以下代码从输入文本中删除停用词。我在tokenStream.incrementToken()运行时遇到异常。

java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.

代码:

public static String removeStopWords(String textFile) throws Exception {
        CharArraySet stopWords = EnglishAnalyzer.getDefaultStopSet();
        TokenStream tokenStream = new StandardTokenizer();
        tokenStream = new StopFilter(tokenStream, stopWords);
        StringBuilder sb = new StringBuilder();
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            String term = charTermAttribute.toString();
            sb.append(term + " ");
        }
        return sb.toString();
    }

1 个答案:

答案 0 :(得分:1)

将TokenStream实例化如下 -

TokenStream tokenStream = new StandardAnalyzer().tokenStream("field",new StringReader(textFile));