字符类中的空格与ICU4J音译器中的\ x20不同

时间:2015-03-18 19:54:24

标签: java regex icu

我想修改icu4j cyrillic to latin以保留空格。显而易见的是

@Test
public void test1() {
    String greek
            = "'E\u00E9 \u043c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u043a\u0430";
    String id1 = "Any-Latin; NFD; [^\\p{Alnum} ] Remove";
    String id2 = "Any-Latin; NFD";
    String latin1 = com.ibm.icu.text.Transliterator.getInstance(id1)
            .transform(greek);
    Assert.assertEquals("Ee matematika", latin1);
}

但是失败了(使用ICU4J 54.1.1):

junit.framework.ComparisonFailure: expected:<Ee[ ]matematika> but was:<Ee[]matematika>">junit.framework.ComparisonFailure: expected:<Ee[ ]matematika> but was:<Ee[]matematika> at junit.framework.Assert.assertEquals

我可以使用相同的正则表达式在Java代码中replaceAll并且它确实有效:

@Test
public void test2() {
    String greek
            = "'E\u00E9 \u043c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u043a\u0430";
    String id1 = "Any-Latin; NFD; [^\\p{Alnum} ] Remove";
    String id2 = "Any-Latin; NFD";
    String latin1 = com.ibm.icu.text.Transliterator.getInstance(id1)
            .transform(greek);
    Assert.assertEquals("Eematematika", latin1); // why not "Ee matematika"?
    String latin2 = com.ibm.icu.text.Transliterator.getInstance(id2)
            .transform(greek).replaceAll("[^\\p{Alnum} ]", "");
    Assert.assertEquals("Ee matematika", latin2);
}

并将音译器ID中的空格替换为\\x20。这只是ICU4J中的一个错误还是以某种方式预期的?

1 个答案:

答案 0 :(得分:0)

toString() ReplaceableString输出的transform()可能是:

public String transform(String source) {
    return transliterate(source);
}
...
public final String transliterate(String text) {
    ReplaceableString result = new ReplaceableString(text);
    transliterate(result);
    return result.toString();
}

尝试将您获得的字符串转换为UTF16代码点并检查是否存在差异。