如何在Java字符串中检测日文文本?

时间:2009-09-30 18:16:38

标签: java unicode character-encoding

我需要能够在Java字符串中检测日语字符。

目前我正在获取UnicodeBlock,并检查它是否等于Character.UnicodeBlock.KATAKANA或Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS,但我不是100%会覆盖所有内容。

有什么建议吗?

2 个答案:

答案 0 :(得分:8)

我使用以下java方法。可能不会完全满足你的要求。

<!-- language: lang-java -->
/**
 * Returns if a character is one of Chinese-Japanese-Korean characters.
 * 
 * @param c
 *            the character to be tested
 * @return true if CJK, false otherwise
 */
private boolean isCharCJK(final char c) {
    if ((Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_COMPATIBILITY_IDEOGRAPHS)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_RADICALS_SUPPLEMENT)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION)
            || (Character.UnicodeBlock.of(c) == Character.UnicodeBlock.ENCLOSED_CJK_LETTERS_AND_MONTHS)) {
        return true;
    }
    return false;
}

此外,这些似乎应该适用于平假名和片假名字符:

private boolean isHiragana(final char c)
{
     return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.HIRAGANA);
}

private boolean isKatakana(final char c)
{
     return (Character.UnicodeBlock.of(c)==Character.UnicodeBlock.KATAKANA);
}

答案 1 :(得分:5)

根据regular-expressions.info,日语不是由一个脚本组成的:“没有日语Unicode脚本。相反,Unicode提供平假名,片假名,汉语和拉丁语脚本,日语文档通常由。“

在这种情况下,这个正则表达式应该可以解决这个问题:

yourString.matches("[\\p{Hiragana}\\p{Katakana}\\p{Han}\\p{Latin}]*+")