我正在使用移植过滤器编写一个词干分析器。
public static String stemmer(final String unstemmedText) throws ParseException, IOException {
TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_4_10_4, new StringReader(unstemmedText));
tokenStream = new StopFilter(Version.LUCENE_4_10_4, tokenStream, StandardAnalyzer.STOP_WORDS_SET);
tokenStream = new PorterStemFilter(tokenStream);
StringBuilder sb = new StringBuilder();
CharTermAttribute charTermAttr = tokenStream.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
if (sb.length() > 0) {
sb.append(" ");
}
sb.append(charTermAttr.toString());
}
return sb.toString();
}
我收到以下异常: -
java.lang.NullPointerException
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:923)
at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1133)
因此,经过一些谷歌搜索,我决定在令牌流上调用重置并在处理后结束流。
https://thekandyancode.wordpress.com/2013/02/04/tokenizing-stopping-and-stemming-using-apache-lucene/
所以我的修改是: -
tokenStream.reset();
while (tokenStream.incrementToken()) {
if (sb.length() > 0) {
sb.append(" ");
}
sb.append(charTermAttr.toString());
}
tokenStream.end();
现在我得到了这个例外: -
java.lang.IllegalStateException: TokenStream contract violation: reset()/close() call missing, reset() called multiple times, or subclass does not call super.reset(). Please see Javadocs of TokenStream class for more information about the correct consuming workflow.
所以早期的代码需要重置,当我现在这样做时它根本不起作用。我该怎么办?
我对lucene项目的开发方式感到非常失望。如果由于突然执行而导致重置合同中断,开发人员会在升级前三思而后行。此外,API还在版本之间进行了彻底改进,并且使用了方便的旧类,例如不推荐使用porter stemmers。这使得lucene的使用难以置信。请把袜子拉到一起,然后更好地构建它。