Question

所以在python终端中输入以下内容：

>>> s = "γειά"       ## it just means 'hi' in Greek
>>> s
'\x9a\x9c\xa0\xe1'   ## What is this? - Is it utf-encoding? Is it ascii escaped?
>>> print s
γειά

现在是有趣的部分：

>>> a = u"γειά"
>>> a
u'\u03b3\u03b5\u03b9\u03ac'    # Again what is this? utf-8 encoded? If so, how?
>>> print a
γειά

我对编码特别是utf-8编码的字符串和/或ascii编码的字符串感到困惑。上面两个片段之间的区别是什么？它们如何与unicode函数联系起来？

>>> result = unicode(s)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9a in position 0: ordinal
                     not in range(128)

>>> result = unicode(s, 'utf-8')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 0: invalid s
                     tart byte

有人可以向我解释这里发生了什么吗？提前谢谢。

Answer 1

在您第一次尝试时，您会看到字符串的编码版本，而不是utf-8：

>>> s='\x9a\x9c\xa0\xe1'
>>> s.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 0: invalid start byte

它使用shell正在使用的任何编码进行编码。

在第二个示例中，您正在创建一个unicode字符串。拥有shell编码的Python能够从输入中解码它并将其存储为 unicode codepoints （\u03b3\u03b5\u03b9\u03ac）。稍后，当你print时，Python也知道你的shell的编码，并且能够将它编码从unicode到实际的字节。

关于您的第三个示例，您明确使用了unicode函数。在没有编码作为参数的情况下使用时，它将使用ascii作为默认值。由于ascii无法支持希腊字符，因此Python抱怨这一点。

总而言之，您需要知道您的控制台正在使用什么编码来确定Python对您的代码所做的事情。如果您使用的是Windows，则可以使用chcp命令执行此操作。在Linux上，您可以使用locale命令。

当然，我忘记了有史以来最重要的建议：P。正如@ thg435建议这是必读：Unicode by Joel

另外值得一提的是，Python 3中有很多这些变化。

Python编码unicode字符串

1 个答案: