我在自定义分析器实现的createComponents实现中使用了HTMLStripCharFilter,但HTML并未从内容中剥离。请在下面找到代码。
@Override
protected TokenStreamComponents createComponents(String fieldName)
{
StandardTokenizer source = new StandardTokenizer();
source.setReader(mStripHTML ? new HTMLStripCharFilter(getReader()) : getReader());
source.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
return new TokenStreamComponents(source, result);
}
答案 0 :(得分:1)
您的CharFilter不应该在您的createComponents方法中定义,它应该在initReader中:
@Override
protected Reader initReader(String fieldName, Reader reader) {
return mStripHTML ? new HTMLStripCharFilter(reader) : reader;
}
@Override
protected TokenStreamComponents createComponents(String fieldName)
{
StandardTokenizer source = new StandardTokenizer();
source.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
return new TokenStreamComponents(source, result);
}
答案 1 :(得分:1)
我建议改用CustomAnalyzer:https://lucene.apache.org/core/6_0_1/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
(自Lucene 5.x起可用)