所以我有我的基本代码
public static final Pattern DIACRITICS_AND_FRIENDS
= Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");
private static String stripDiacritics(String str) {
str = Normalizer.normalize(str, Normalizer.Form.NFD);
str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
return str;
}
但是如何将它放入TokenFilter中,之前我使用过NormalizeCharMap,但这只适用于修改字符串文字,我使用的是Lucene 4
答案 0 :(得分:0)
您需要覆盖incrementToken()
方法,在其中您将更新CharTermAttribute
:
public final class DiacriticFilter extends TokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
@Override
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
String result = stripDiacritics(new String(termAtt.buffer()));
char[] newBuffer = result.toCharArray();
termAtt.copyBuffer(newBuffer, 0, newBuffer.length)
termAtt.setLength(newBuffer.length);
return true;
} else {
return false;
}
}
private static String stripDiacritics(String str) {
str = Normalizer.normalize(str, Normalizer.Form.NFD);
str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
return str;
}
}