如何从UTF-8输入检测脚本系统/字母?

时间:2014-11-20 18:46:50

标签: java unicode utf-8 icu

我目前正在构建基于icu4j的音译网络界面。自动检测用户输入查询的脚本系统的最佳方法是什么?

E.g。如果输入是身体里或عالمتاب我怎么能/应该从哪个脚本系统中识别这个呢?

1 个答案:

答案 0 :(得分:2)

最简单的方法是检查第一个字符的脚本:

static Character.UnicodeScript getScript(String s) {
    if (s.isEmpty()) {
        return null;
    }
    return Character.UnicodeScript.of(s.codePointAt(0));
}

更好的方法是找到最常出现的脚本:

static Character.UnicodeScript getScript(String s) {
    int[] counts = new int[Character.UnicodeScript.values().length];

    Character.UnicodeScript mostFrequentScript = null;
    int maxCount = 0;

    int n = s.codePointCount(0, s.length());
    for (int i = 0; i < n; i = s.offsetByCodePoints(i, 1)) {
        int codePoint = s.codePointAt(i);
        Character.UnicodeScript script = Character.UnicodeScript.of(codePoint);

        int count = ++counts[script.ordinal()];
        if (mostFrequentScript == null || count > maxCount) {
            maxCount = count;
            mostFrequentScript = script;
        }
    }

    return mostFrequentScript;
}