这是我在尝试为SOLR 4.x实现自定义词干过滤器时遇到的一个相当不寻常的问题。生成的第一个令牌的最后一个字符/后缀在通过我的自定义过滤器后会附加到流中的后续标记。
请参阅屏幕截图以供参考,
字段类型定义:
<fieldType name="text_hi_cust" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_hi.txt" />
<filter class="com.rev.solr.utils.hindi.stemmer.HindiStemFilterFactory"/>
</analyzer>
字段定义:
<field name="loc_hi_2" type="text_hi_cust" indexed="true" stored="true"/>
Stemmer Filter Factory:
public class HindiStemFilterFactory extends TokenFilterFactory{
public HindiStemFilterFactory(Map<String, String> args) {
super(args);
// TODO Auto-generated constructor stub
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
@Override
public TokenStream create(TokenStream ts) {
// TODO Auto-generated method stub
return new HindiStemFilter(ts);
}}
Stemmer过滤器:
public final class HindiStemFilter extends TokenFilter {
private final CharTermAttribute termAttr;
private final KeywordAttribute keywordAttr;
private final HindiStemmer stemmer;
protected HindiStemFilter(TokenStream input) {
super(input);
// TODO Auto-generated constructor stub
termAttr = addAttribute(CharTermAttribute.class);
keywordAttr = addAttribute(KeywordAttribute.class);
stemmer = new HindiStemmer();
}
@Override
public boolean incrementToken() throws IOException {
// TODO Auto-generated method stub
if (input.incrementToken()) {
if (!keywordAttr.isKeyword())
termAttr.setLength(stemmer.stem(termAttr.buffer(),
termAttr.length()));
return true;
} else {
return false;
}
}
}
Hindi Stemmer
public int stem(char buffer[], int len) throws IOException{
loadDictionaries();
String input = new String(buffer);
int rootLen = getRootlength(input.trim());//Returns the length of the root word.
return rootLen;
}
任何指针都将不胜感激。谢谢!
答案 0 :(得分:0)
最后找出了问题及其解决方案。可能不是最好的,但它确实有效。
观察到的问题:
public int stem(char buffer [],int len)抛出IOException {
String input = new String(buffer);
int rootLen = getRootlength(input.trim());
for(int i=0;i<rootLen;i++){
buffer[i] = input.charAt(i);
return rootLen;
}
希望这有帮助!