如何规范所有特殊字符但变音符号?

时间:2014-07-03 08:00:51

标签: java unicode normalization matcher unicode-normalization

我想规范化任何扩展的ascii字符,但不包括变音符号。

如果我想包括变音符号,我会选择:

Normalizer.normalize(value, Normalizer.Form.NFKD)
    .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

但我怎样才能排除德国变音符号?

结果我想得到:

来源:üöäâÇæôøñÁ

期望的结果:üöäaCaeoonA或类似的

2 个答案:

答案 0 :(得分:1)

从这里我看到2个解决方案,第一个很脏,第二个很难实现我猜。

答案 1 :(得分:1)

// Latin to ASCII - mostly
private static final String TAB_00C0 = "" +
        "AAAAÄAACEEEEIIII" +
        "DNOOOOÖ×OUUUÜYTß" +
        "aaaaäaaceeeeiiii" +
        "dnooooö÷ouuuüyty" +
        "AaAaAaCcCcCcCcDd" +
        "DdEeEeEeEeEeGgGg" +
        "GgGgHhHhIiIiIiIi" +
        "IiJjJjKkkLlLlLlL" +
        "lLlNnNnNnnNnOoOo" +
        "OoOoRrRrRrSsSsSs" +
        "SsTtTtTtUuUuUuUu" +
        "UuUuWwYyYZzZzZzs";

private static HashMap<Character, String> LIGATURES = new HashMap<>(){{
    put('æ', "ae"); 
    put('œ', "oe");
    put('þ', "th");
    put("ij", "ij");
    put('ð', "dh");
    put("Æ", "AE");
    put("Œ", "OE");
    put("Þ", "TH");
    put("Ð", "DH");
    put("IJ", "IJ");
    //TODO
}};

public static String removeAllButUmlauts(String value) {
    value = Normalizer.normalize(value, Normalizer.Form.NFC);
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < source.length(); i++) {
        char c = source.charAt(i);
        String  l = LIGATURES.get(c);
        if (l != null){
            sb.append(l);
        } else if (c < 0xc0) {
            sb.append(c); // ASCII and C1 control codes
        } else if (c >= 0xc0 && c <= 0x17f) {
            c = TAB_00C0.charAt(c - 0xc0); // common single latin letters
            sb.append(c);
        } else { 
            // anything else, including Vietnamese and rare diacritics
            l = Normalizer.normalize(Character.toString(c), Normalizer.Form.NFKD)
                    .replaceAll("[\\p{InCombiningDiacriticalMarks}]+", "");
            sb.append(l);
        }

    }
    return sb.toString();
}

然后

String value = "üöäâÇæôøñÁ";
String after = removeAllButUmlauts(value);
System.out.println(after)

给出:

üöäaCaeoonA