我正在研究lucene 4.7并试图迁移我们在solr配置中使用的一个分析器。
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="0"
splitOnNumerics="0"
preserveOriginal="1"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
但是,我无法弄清楚如何使用HTMLStripCharFilterFactory和WordDelimiterFilterFactory以及上面的配置。另外,对于我在solr中的查询,我的分析器如下,如何在lucene中实现相同的目的。
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
答案 0 :(得分:5)
Analysis package documentation解释了如何使用CharFilter
。您可以在覆盖的initReader
方法中将读取器包装起来。
我假设你WordDelimiterFilter
的问题是你不知道如何设置你正在使用的配置选项?通过将适当的常量与二进制和(&
)组合,构造一个int以传递给构造函数。如:
int config = WordDelimiterFilter.GENERATE_NUMBER_PARTS & WordDelimiterFilter.GENERATE_WORD_PARTS; //etc.
所以,最后你最终会得到类似的东西:
//StopwordAnalyzerBase grants you some convenient ways to handle stop word sets.
public class MyAnalyzer extends StopwordAnalyzerBase {
private final Version version = Version.LUCENE_47;
private int wordDelimiterConfig;
public MyAnalyzer() throws IOException {
super(version, loadStopwordSet(new FileReader("stopwords.txt"), matchVersion));
//Might as well load this config up front, along with the stop words
wordDelimiterConfig =
WordDelimiterFilter.GENERATE_WORD_PARTS &
WordDelimiterFilter.GENERATE_NUMBER_PARTS &
WordDelimiterFilter.CATENATE_WORDS &
WordDelimiterFilter.CATENATE_NUMBERS &
WordDelimiterFilter.PRESERVE_ORIGINAL;
}
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new WhitespaceTokenizer(version, reader);
TokenStream filter = new WordDelimiterFilter(source, wordDelimiterConfig, null);
filter = new LowercaseFilterFactory(version, filter);
filter = new StopFilter(version, filter, stopwords);
filter = new PorterStemFilter(filter);
return new TokenStreamComponents(source, filter);
}
@Override
protected Reader initReader(String fieldName, Reader reader) {
return new HTMLStripCharFilter(reader);
}
}
注意:我已将StopFilter
移至LowercaseFilter
之后。这使得它不区分大小写,只要您的停用词定义全部为小写。由于WordDelimiterFilter
,不知道这是否有问题。如果是这样,有一个loadStopwordSet
method that support case insensitivity,但坦率地说,我不知道如何使用它。