我有两个 unicode 字符,它们的含义相同。 compat
字符是对 origin
字符的引用,这意味着两者应该是相同的值,但是当我尝试断言它们与某个条件相等时,它返回 False
。>
origin = 'ᅢ' # korean letter for: AE
compat = 'ㅐ' # korean letter for: AE
print('origin', ascii(origin))
print('compat', ascii(compat), '\n')
decompose_origin = unicodedata.decomposition(origin)
decompose_compat = unicodedata.decomposition(compat)
print('decompose: origin', decompose_origin)
print('decompose: compat', decompose_compat, '\n')
# expected output: True
print(decompose_origin == decompose_compat)
origin '\u1162'
compat '\u3150'
decompose: origin
decompose: compat <compat> 1162
False
答案 0 :(得分:2)
Normalize the strings to NFKC
or NFKD
normal form 使它们具有可比性:
from unicodedata import normalize
origin = '\u1162'
compat = '\u3150'
for normal_form in ('NFC', 'NFD', 'NFKC', 'NFKD'):
print(normal_form, ascii(normalize(normal_form, origin + ' == ' + compat)))
print(normalize(normal_form, origin) == normalize(normal_form, compat))
# NFC '\u1162 == \u3150'
# False
# NFD '\u1162 == \u3150'
# False
# NFKC '\u1162 == \u1162'
# True
# NFKD '\u1162 == \u1162'
# True
NFKC
和 NFKD
都执行“兼容性分解,即将所有兼容性字符替换为其等效字符”。 NFKC
范式也适用于规范组合。