是否有一个库可以告诉特定unicode字符属于哪个脚本?
例如,对于输入“u'ሕ'”,它应该返回Ethiopic或类似的。
答案 0 :(得分:7)
也许unicodedata
模块中的数据正是您所寻找的:
print unicodedata.name(u"ሕ")
打印
ETHIOPIC SYLLABLE HHE
打印的名称可用于查找相应的字符:
unicodedata.lookup("ETHIOPIC SYLLABLE HHE")
答案 1 :(得分:2)
您可以解析Scripts.txt
文件:
# -*- coding: utf-8; -*-
import bisect
script_file = "/path/to/Scripts.txt"
scripts = []
with open(script_file, "rt") as stream:
for line in stream:
line = line.split("#", 1)[0].strip()
if line:
rng, script = line.split(";", 1)
elems = rng.split("..", 1)
start = int(elems[0], 16)
if len(elems) == 2:
stop = int(elems[1], 16)
else:
stop = start
scripts.append((start, stop, script.lstrip()))
scripts.sort()
indices = [elem[0] for elem in scripts]
def find_script(char):
if not isinstance(char, int):
char = ord(char)
index = bisect.bisect(indices, char) - 1
start, stop, script = scripts[index]
if start <= char <= stop:
return script
else:
return "Unknown"
print find_script(u'A')
print find_script(u'Д')
print find_script(u'ሕ')
print find_script(0x1000)
print find_script(0xE007F)
print find_script(0xE0080)
请注意,代码既不健壮也不优化。您应该测试参数是否表示有效的字符或代码点,并且您应该合并相邻的等效范围。