Question

我的代码：

a = '汉'
b = u'汉'

这两个是相同的汉字。但很明显，a == b是False。我该如何解决？请注意，我无法将a转换为utf-8，因为我无法访问代码。我需要将b转换为a正在使用的编码。

所以，我的问题是，如何将b的编码转换为a的编码？

Answer 1

如果您不知道a的编码，则需要：

b

首先，要检测a的编码，我们使用chardet。

$ pip install chardet

现在让我们使用它：

>>> import chardet
>>> a = '汉'
>>> chardet.detect(a)
{'confidence': 0.505, 'encoding': 'utf-8'}

所以，要真正完成你的要求：

>>> encoding = chardet.detect(a)['encoding']
>>> b = u'汉'
>>> b_encoded = b.encode(encoding)
>>> a == b_encoded
True

Answer 2

使用str.decode解码编码字符串a：

>>> a = '汉'
>>> b = u'汉'
>>> a.decode('utf-8') == b
True

注意根据源代码编码替换utf-8。

Answer 3

a.decode和b.encode都可以：

In [133]: a.decode('utf') == b
Out[133]: True

In [134]: b.encode('utf') == a
Out[134]: True

请注意，str.encode和unicode.decode也可用，请勿混淆。请参阅 What is the difference between encode/decode?