Question

我正在使用Python 2.7.11。

我有2个元组：

>>> t1 = (u'aaa', u'bbb')
>>> t2 = ('aaa', 'bbb')

我试过这个：

>>> t1==t2
True

Python如何处理unicode和非unicode？

Answer 1

Python 2认为bytestrings和unicode相等。顺便说一句，这与包含元组无关。相反，它与隐式类型转换有关，我将在下面解释。

用“简单”的ascii代码点来演示它很困难，所以要了解真正发生的事情，我们可以通过使用更高的代码点来引发失败：

>>> bites = u'Ç'.encode('utf-8')
>>> unikode = u'Ç'
>>> print bites
Ç
>>> print unikode
Ç
>>> bites == unikode
/Users/wim/Library/Python/2.7/bin/ipython:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  #!/usr/bin/python
False

在看到上面的unicode和字节比较时，python隐含地尝试通过假设字节是用sys.getdefaultencoding()（在我的平台上是'ascii'）编码来将字节串解码为unicode对象。

在上面刚才显示的情况下，这个失败了，因为字节是以'utf-8'编码的。现在，让它“工作”：

>>> bites = u'Ç'.encode('ISO8859-1')
>>> unikode = u'Ç'
>>> import sys
>>> reload(sys)   # please don't ever actually use this hack, guys 
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('ISO8859-1')
>>> bites == unikode
True

你的上转换“以相同的方式工作”，但使用'ascii'编解码器。字节和unicode之间的这种隐式转换实际上非常邪恶并且可能导致很多pain，因此决定停止在Python 3中执行这些操作，因为“显式优于隐式”。

作为一个小小的题外话，在Python 3+上，你的代码实际上都代表了unicode字符串文字，所以它们无论如何都是相同的。默认忽略u前缀。如果你想在python3中使用bytestring文字，你需要像b'this'那样指定它。然后你想要1）显式解码字节，或2）在进行比较之前显式编码unicode对象。

python如何将unicode和非unicode元组视为相等？

1 个答案: