Question

我在数据库中有一些数据，该数据是用户输入的“BTS⚾️>BTS?”，即“ BTS” +棒球表情+“> BTS” +麦克风表情。当我从数据库中读取它，对其进行解码并在Python 2中进行打印时，它会正确显示表情符号。但是，当我尝试在Python 3中解码相同的字节时，它失败并显示UnicodeDecodeError。

Python 2中的字节：

>>> data
'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'

将它们解码为UTF-8会输出以下unicode字符串：

>>> 'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
u'BTS\u26be\ufe0f>BTS\U0001f3a4'

在Mac上打印该unicode字符串会显示棒球和麦克风表情符号：

>>> print u'BTS\u26be\ufe0f>BTS\U0001f3a4'
BTS⚾️>BTS?

但是在Python 3中，解码与UTF-8相同的字节会给我一个错误：

>>> b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 13: invalid continuation byte

尤其是最后6个字节（麦克风表情符号）似乎有点问题：

>>> b'\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

此外，其他工具（例如此在线十六进制到Unicode转换器）告诉我这些字节不是有效的Unicode字符：

https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4

为什么Python 2和编码用户输入的任何程序都认为这些字节是麦克风表情符号，而Python 3和其他工具却不这样呢？

Answer 1

似乎有几个网页可以帮助回答您的问题：

https://bugs.python.org/issue9133（与Python 2过度允许的UTF-8处理有关）
How to work with surrogate pairs in Python?（与处理该许可有关）

如果我使用Python 3的“ surrogatepass”错误处理程序对从Python 2获得的字节进行解码，那就是：

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8',
    errors = 'surrogatepass')

然后我得到字符串'BTS⚾️>BTS\ud83c\udfa4'，其中'\ud83c\udfa4'是代理对，应该代表麦克风emogi。

您可以返回Python 3中的麦克风，方法是使用“ surrogate pass”将具有代理对的字符串编码为UTF-16并将其解码为UTF-16：

>>> string_as_utf_8 = b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8', errors='surrogatepass')
>>> bytes_as_utf_16 = string_as_utf_8.encode('utf_16', errors='surrogatepass')
>>> string_as_utf_16 = bytes_as_utf_16.decode('utf_16')
>>> print(string_as_utf_16)
BTS⚾️>BTS?

Answer 2

尝试再次在python 3的utf-8中对此字节u'BTS\u26be\ufe0f>BTS\U0001f3a4'进行编码

text = u'BTS\u26be\ufe0f>BTS\U0001f3a4'
result = text.encode('utf_8')
print(result)
result.decode('utf_8')

result包含以下字节：

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xf0\x9f\x8e\xa4'

与python 2有所不同

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'

但是如果您再次解码python 3中utf-8中的result：b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xf0\x9f\x8e\xa4'，您将收到想要的结果

简而言之，python2和python3的工作方式不同，因此必须将唯一的解码字节保存到数据库中。

为什么Python 2认为这些字节是麦克风表情符号，而Python 3却不呢？

2 个答案: