如何编写LuceneFilter来规范化文本

时间:2014-09-28 19:59:44

标签: java lucene

所以我有我的基本代码

public static final Pattern DIACRITICS_AND_FRIENDS
        = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");


private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}

但是如何将它放入TokenFilter中,之前我使用过NormalizeCharMap,但这只适用于修改字符串文字,我使用的是Lucene 4

1 个答案:

答案 0 :(得分:0)

您需要覆盖incrementToken()方法,在其中您将更新CharTermAttribute

public final class DiacriticFilter extends TokenFilter {
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    @Override
    public final boolean incrementToken() throws IOException {
        if (input.incrementToken()) {
            String result = stripDiacritics(new String(termAtt.buffer()));
            char[] newBuffer = result.toCharArray();
            termAtt.copyBuffer(newBuffer, 0, newBuffer.length)
            termAtt.setLength(newBuffer.length);
            return true;
        } else {
            return false;
        }
    }

    private static String stripDiacritics(String str) {
        str = Normalizer.normalize(str, Normalizer.Form.NFD);
        str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
        return str;
    }
}