我想规范化任何扩展的ascii字符,但不包括变音符号。
如果我想包括变音符号,我会选择:
Normalizer.normalize(value, Normalizer.Form.NFKD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
但我怎样才能排除德国变音符号?
结果我想得到:
来源:üöäâÇæôøñÁ
期望的结果:üöäaCaeoonA
或类似的
答案 0 :(得分:1)
从这里我看到2个解决方案,第一个很脏,第二个很难实现我猜。
从字符串中删除要用变音符号标准化字符的字符串,然后在规范化后将它们放回去。
不要使用pre-buit模式p{InCombiningDiacriticalMarks}
。而是建立自己的,不包括变音符号。
看看:
答案 1 :(得分:1)
// Latin to ASCII - mostly
private static final String TAB_00C0 = "" +
"AAAAÄAACEEEEIIII" +
"DNOOOOÖ×OUUUÜYTß" +
"aaaaäaaceeeeiiii" +
"dnooooö÷ouuuüyty" +
"AaAaAaCcCcCcCcDd" +
"DdEeEeEeEeEeGgGg" +
"GgGgHhHhIiIiIiIi" +
"IiJjJjKkkLlLlLlL" +
"lLlNnNnNnnNnOoOo" +
"OoOoRrRrRrSsSsSs" +
"SsTtTtTtUuUuUuUu" +
"UuUuWwYyYZzZzZzs";
private static HashMap<Character, String> LIGATURES = new HashMap<>(){{
put('æ', "ae");
put('œ', "oe");
put('þ', "th");
put("ij", "ij");
put('ð', "dh");
put("Æ", "AE");
put("Œ", "OE");
put("Þ", "TH");
put("Ð", "DH");
put("IJ", "IJ");
//TODO
}};
public static String removeAllButUmlauts(String value) {
value = Normalizer.normalize(value, Normalizer.Form.NFC);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < source.length(); i++) {
char c = source.charAt(i);
String l = LIGATURES.get(c);
if (l != null){
sb.append(l);
} else if (c < 0xc0) {
sb.append(c); // ASCII and C1 control codes
} else if (c >= 0xc0 && c <= 0x17f) {
c = TAB_00C0.charAt(c - 0xc0); // common single latin letters
sb.append(c);
} else {
// anything else, including Vietnamese and rare diacritics
l = Normalizer.normalize(Character.toString(c), Normalizer.Form.NFKD)
.replaceAll("[\\p{InCombiningDiacriticalMarks}]+", "");
sb.append(l);
}
}
return sb.toString();
}
然后
String value = "üöäâÇæôøñÁ";
String after = removeAllButUmlauts(value);
System.out.println(after)
给出:
üöäaCaeoonA