Question

我尝试了print a_str.decode("utf-8")，print uni_str，print uni_str.decode("utf-8")，print uni_str.encode("utf-8") ..

但只有第一个有效。

 >>> print '\xe8\xb7\xb3'.decode("utf-8")
 跳
 >>> print u'\xe8\xb7\xb3\xe8'
 è·³è
 >>> print u'\xe8\xb7\xb3\xe8'.decode("utf-8")
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
 UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
 >>> print u'\xe8\xb7\xb3\xe8'.encode("utf-8")
 è·³è

我对如何正常显示Unicode字符串感到困惑。如果我有这样的字符串： a=u'\xe8\xb7\xb3\xe8'，如何打印a？

Answer 1

'\xe8\xb7\xb3'是使用utf8编码的中文字符，因此'\xe8\xb7\xb3'.decode('utf-8')工作正常，返回跳 的unicode值，{{ 1}}。但u'\u8df3'是一个文字unicode字符串，与u'\xe8\xb7\xb3'的unicode不同。并且unicode字符串不能是跳，它是unicode。最后， ~~decoded实际上不是一个有效的unicode字符串~~ [1]。

a=u'\xe8\xb7\xb3\xe8'来自哪里？另一个功能？

[1]查看第一条评论。

Answer 2

如果你有这样的字符串那么它就坏了。您需要将其编码为Latin-1，以使其具有相同字节值的字节串，然后解码为UTF-8。

Answer 3

unicode字符串u'\xe8\xb7\xb3\xe8'相当于u'\u00e8\u00b7\u00b3\u00e8'。你想要的是u'\u8df3'，它可以在utf8中编码为'\xe8\xb7\xb3'。

在Python中，unicode是一个UCS-2字符串（构建选项）。因此，u'\xe8\xb7\xb3\xe8'是一个包含4个16位Unicode字符的字符串。

如果你的utf-8字符串（8位字符串）错误地显示为Unicode（16位字符串），则必须先将其转换为8位字符串：

>>> ''.join([chr(ord(a)) for a in u'\xe8\xb7\xb3']).decode('utf8')
u'\u8df3'

注意'\xe8\xb7\xb3\xe8'无效utf8字符串，因为最后一个字节'\xe8'是双字节序列的第一个字符，不能终止utf8字符串。

这些方法在Python中处理Unicode字符串有什么区别？

3 个答案: