Question

我对python的unicode / str进程感到困惑。我在python2中遇到过一些情况。

以下句子在IDE pycharm中使用utf8编码的py文件中写道。

print "hello! %s" % u"中国"
print "hello! %s" % "中国"
print u"hello! %s" % "中国"

只有情况3引发解码错误：

UnicodeDecodeError：'ascii'编解码器无法将字节0xe4解码到位 0：序数不在范围内（128）。

有人可以告诉我python是如何处理这句话的。为什么会有结果？

Answer 1

如果删除print语句，可以看到更多细节：

>>> "hello! %s" % u"中国"
u'hello! \u4e2d\u56fd'
>>> "hello! %s" % "中国"
'hello! \xe4\xb8\xad\xe5\x9b\xbd'
>>> u"hello! %s" % "中国"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

这让我们知道发生了什么。每当涉及任何unicode字符串时，Python都会尝试将另一端转换为unicode;而且，像往常一样，没有任何相反的指示，它总是假设编码是ASCII。

在第一种情况下，它尝试将“hello”字节串转换为unicode;由于没有非ASCII字符，因此工作正常，并且可以使用现有的unicode字符串安全地插入结果。

在第二种情况下，双方都是字节串，因此不尝试转换;结果仍然是一个字节串。

在第三种情况下，“你好”已经是unicode所以它试图转换另一方;但由于这些是非ASCII字符，因此失败。但是，直接指定编码确实有效：

>>> u"hello! %s" % "中国".decode('utf-8')
u'hello! \u4e2d\u56fd'

python2如何在内部处理字符串和unicode？

1 个答案: