我使用的是Lucene 5.1.0,我希望我的索引编写器只能索引以大写字母开头的术语。 我查看了自定义分析器和模式标记器,但我无法理解如何使用这些标记来仅使用大写字母索引开头(或所有字母)的单词。 任何帮助将不胜感激
答案 0 :(得分:4)
我发现这个链接有助于我的头围绕自定义标记器/分析器/过滤器: http://www.citrine.io/blog/2015/2/14/building-a-custom-analyzer-in-lucene
但是,在您的情况下,我认为扩展org.apache.lucene.analysis.util.FilteringTokenFilter
而不是TokenFilter
更容易:
public class StartsWithCapitalTokenFilter extends FilteringTokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public StartsWithCapitalTokenFilter(TokenStream tokenStream) {
super(tokenStream);
}
@Override
public boolean accept() {
// When accept() is called, my understanding is that termAtt.buffer() will
// contain the particular string (in char[] form) of whichever token
// is under consideration. This call gets the Unicode code point of the
// first character and checks if it's uppercase.
return Character.isUpperCase(Character.codePointAt(termAtt.buffer(),0));
// Or if you don't want to care about Unicode about U+FFFF, use the below.
//return Character.isUpperCase(termAtt.buffer()[0]);
}
}
然后,您需要使用某种自定义分析器来使用过滤器。这个只使用新的过滤器:
public class StartswithCapitalAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String field, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream filter = new StartsWithCapitalTokenFilter(tokenizer);
// chain any other filters you want in here, like so:
//filter = new LowerCaseFilter(filter);
return new TokenStreamComponents(tokenizer, filter);
}
}
这应该都是功能性的,但我现在没有环境可以测试它。 祝你好运!