如何合并ICU4J中的音译规则?

时间:2016-01-12 09:35:13

标签: java icu icu4j

我正在使用ICU4J并尝试合并音译规则。 为了最终结果,我需要将所有德国umalaut字符转换为DIN 5007-2替代品,并将所有非ASCII字符转换为ASCII版本。

当我尝试这样做时:

import com.ibm.icu.text.Transliterator;

public class Main
{
    public static void main(String[] args)
    {
        Transliterator latinASCII = Transliterator.getInstance("Latin-ASCII");
        String german_DIN_5007_2Rules ="$beforeLower = [[:Mn:][:Me:]]* [:Lowercase:];\n" +
            "\\u00e4 > ae;\n" +
            "\\u00f6 > oe;\n" +
            "\\u00fc > ue;\n" +
            "\\u00c4 } $beforeLower > Ae;\n" +
            "\\u00d6 } $beforeLower > Oe;\n" +
            "\\u00dc } $beforeLower > Ue;\n" +
            "\\u00c4 > AE;\n" +
            "\\u00d6 > OE;\n" +
            "\\u00dc > UE;\n";
            //"\\u00df > ss;\n";

        String latinASCIIRules = latinASCII.toRules(true);

        String germanASCIIRules = latinASCIIRules + german_DIN_5007_2Rules;

        Transliterator germanASCII = Transliterator.createFromRules("german_DIN_5007_2", germanASCIIRules, Transliterator.FORWARD);

        String result1 = germanASCII.transliterate("Häuser Bäume Höfe Gärten daß Ü ü ö ä Ä Ö ß");
        String result2 = germanASCII.transliterate("Ç,ü,é,â,ä,à,ç,ê,ë,è,ï,î,ì,Ä,Å,É,æ,Æ,ô,ö,ò,û,ù,Ô,Û,Ã,ã,Ñ,Õ,õ,Ä,Ë,Ï,Ö,Ü,Ÿ,Ç,Œ,œ,ū,Ð,ð,Ċ,ċ,Ġ,ġ,ů,Ů,š,Š,Ě,ť,ž,Ć,Ł,Ó,Ź,ą,ę,ń,ś,ż,ÿ,Ö,Ü,á,í,ó,ú,ñ,Ñ,À,È,Ì,Ò,Ù,Á,É,Í,Ó,Ú,Ý,Â,Ê,Î,ß,Ø,ø,Å,å,Þ,þ,Ā,Ē,Ī,Ō,Ū,ā,ē,ī,ō,ě,Ů,ů,Č,č,Ď,ď,Ľ,ľ,Ň,ň,Ř,ř,Š,š,Ť,Ž,Ą,Ę,Ń,Ś,Ż,ć,ł,ó,ź, ,/");

        System.out.println(result1);
        System.out.println(result2);
    }
}  

我明白了:

Hauser Baume Hofe Garten dass U u o a A O ss     
C,u,e,a,a,a,c,e,e,e,i,i,i,A,A,E,ae,AE,o,o,o,u,u,O,U,A,a,N,O,o,A,E,I,O,U,Y,C,OE,oe,u,D,d,C,c,G,g,u,U,s,S,E,t,z,C,L,O,Z,a,e,n,s,z,y,O,U,a,i,o,u,n,N,A,E,I,O,U,A,E,I,O,U,Y,A,E,I,ss,O,o,A,a,TH,th,A,E,I,O,U,a,e,i,o,e,U,u,C,c,D,d,L,l,N,n,R,r,S,s,T,Z,A,E,N,S,Z,c,l,o,z, ,/

这是不正确的,因为德语变音符号未被转换

ä → ae
ö → oe
ü → ue
Ä → Ae
Ö → Oe
Ü → Ue

如果我按照以下规定恢复germanASCIIRules的订单:

String germanASCIIRules = german_DIN_5007_2Rules + latinASCIIRules;

我明白了:

Exception in thread "main" com.ibm.icu.impl.IllegalIcuArgumentException: Compound filters misplaced
at com.ibm.icu.text.TransliteratorParser.parseRules(TransliteratorParser.java:1101)
at com.ibm.icu.text.TransliteratorParser.parse(TransliteratorParser.java:867)
at com.ibm.icu.text.Transliterator.createFromRules(Transliterator.java:1413)
at com.stepstone.Main.main(Main.java:26)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)

如果我没有合并规则,并且只有这样的用户使用german_DIN_5007_2规则:

 Transliterator germanASCII = Transliterator.createFromRules("german_DIN_5007_2", german_DIN_5007_2Rules, Transliterator.FORWARD);

我明白了:

Haeuser Baeume Hoefe Gaerten daß UE ue oe ae AE OE ß

Ç,ue,é,â,ae,?,ç,?,ë,?,?,î,?,AE,?,É,?,?,ô,oe,?,?,?,Ô,?,?,?,?,?,?,AE,Ë,?,OE,UE,?,Ç,?,?,?,?,?,?,?,?,?,ů,Ů,š,Š,Ě,ť,ž,Ć,Ł,Ó,Ź,ą,ę,ń,ś,ż,?,OE,UE,á,í,ó,ú,?,?,?,?,?,?,?,Á,É,Í,Ó,Ú,Ý,Â,?,Î,ß,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,ě,Ů,ů,Č,č,Ď,ď,Ľ,ľ,Ň,ň,Ř,ř,Š,š,Ť,Ž,Ą,Ę,Ń,Ś,Ż,ć,ł,ó,ź, ,/

这里的变音符号被正确音译,但所有剩余的字符都搞砸了:(

1 个答案:

答案 0 :(得分:1)

这对我有用:

public class TransliteratorWrapper {

Transliterator germanASCII = null;

public TransliteratorWrapper()
{
    Transliterator latinASCII = Transliterator.getInstance("Latin-ASCII");
    String german_DIN_5007_2Rules ="$beforeLower = [[:Mn:][:Me:]]* [:Lowercase:];\n" +
            "\\u00e4 > ae;\n" +
            "\\u00f6 > oe;\n" +
            "\\u00fc > ue;\n" +
            "\\u00c4 } $beforeLower > Ae;\n" +
            "\\u00d6 } $beforeLower > Oe;\n" +
            "\\u00dc } $beforeLower > Ue;\n" +
            "\\u00c4 > AE;\n" +
            "\\u00d6 > OE;\n" +
            "\\u00dc > UE;\n";

    String latinASCIIRules = latinASCII.toRules(true);

    String germanASCIIRules = latinASCIIRules.replace("::NFD();", german_DIN_5007_2Rules + "\n::NFD();");

    germanASCII = Transliterator.createFromRules("german_DIN_5007_2", germanASCIIRules, Transliterator.FORWARD);
}

public String transliterate(String text)
{
    return transliterateGerman(text);
}

public String transliterateGerman(String text)
{
    return germanASCII.transliterate(text);
}
}