是否有一个Python库,它为给定的unicode字符或字符串提供脚本名称?

时间:2011-02-09 11:43:04

标签: python unicode

是否有一个库可以告诉特定unicode字符属于哪个脚本?

例如,对于输入“u'ሕ'”,它应该返回Ethiopic或类似的。

2 个答案:

答案 0 :(得分:7)

也许unicodedata模块中的数据正是您所寻找的:

print unicodedata.name(u"ሕ")

打印

ETHIOPIC SYLLABLE HHE

打印的名称可用于查找相应的字符:

unicodedata.lookup("ETHIOPIC SYLLABLE HHE")

答案 1 :(得分:2)

您可以解析Scripts.txt文件:

# -*- coding: utf-8; -*-

import bisect

script_file = "/path/to/Scripts.txt"
scripts = []

with open(script_file, "rt") as stream:
    for line in stream:
        line = line.split("#", 1)[0].strip()
        if line:
            rng, script = line.split(";", 1)
            elems = rng.split("..", 1)
            start = int(elems[0], 16)
            if len(elems) == 2:
                stop = int(elems[1], 16)
            else:
                stop = start
            scripts.append((start, stop, script.lstrip()))

scripts.sort()
indices = [elem[0] for elem in scripts]

def find_script(char):
    if not isinstance(char, int):
        char = ord(char)
    index = bisect.bisect(indices, char) - 1
    start, stop, script = scripts[index]
    if start <= char <= stop:
        return script
    else:
        return "Unknown"

print find_script(u'A')
print find_script(u'Д')
print find_script(u'ሕ')
print find_script(0x1000)
print find_script(0xE007F)
print find_script(0xE0080)

请注意,代码既不健壮也不优化。您应该测试参数是否表示有效的字符或代码点,并且您应该合并相邻的等效范围。