在Python中使用Unicode字符时的不同编码

时间:2018-08-11 17:04:16

标签: python python-unicode unicode-normalization

当遇到复合unicode而不是内置unicode时,我在Python中遇到问题。这是复制代码:

# encoding=utf8

a = ["Địa"]
b = ["Địa"]

print(a)  # ['\xc4\x90i\xcc\xa3a']
print(b)  # ['\xc4\x90\xe1\xbb\x8ba']

print("Địa" in a)  # False
print("Địa" in b)  # True

如何将它们转换/归一化为同一编码器?

1 个答案:

答案 0 :(得分:1)

您可以使用unicodedata.normalize()

# encoding=utf8
import unicodedata
a = ["Địa"]
b = ["Địa"]

print("Địa" in [unicodedata.normalize('NFC', i) for i in a])
print("Địa" in [unicodedata.normalize('NFC', i) for i in b])

这将输出:

True
True