Question

我更新了Python2软件包以支持Python3，并且由于某些编码问题，我坚持处理一个单一的测试用例，该用例在Python3下失败。该软件包通常处理URL标准化，并在卸载到PyPi上的一些库之前或之后进行一些自定义转换。

在Python2中，我可能有两个字符串，它们都是相同URL的编码，例如：

url_a = u'http://➡.ws/♥'
url_b =  'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

满足以下条件：

url_a.encode('utf-8') == url_b
>>> True
type(url_a.encode('utf-8')) == str
>>> True

经过一堆杂项路由后，它们都被标准化为punycode

url_result = 'http://xn--hgi.ws/%E2%99%A5'

在Python3下，我遇到了麻烦，因为url_a.encode('utf-8')返回了bytestring，这也是以这种格式定义变量时所必需的声明。

url_a.encode('utf-8')
>>> b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
url_a.encode('utf-8') == url_b
>>> False
type(url_a.encode('utf-8')) == str
>>> True
type(url_a.encode('utf-8')) == bytes
>>> True

我想不通一种对url_b进行操作以使其按照我的要求进行编码/解码的方法。

我可以使用字节串声明来定义我的测试用例，一切都会在两种环境中通过...

url_a = u'http://➡.ws/♥'
url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

由于消息传递队列或数据库中尚未处理的数据，仍有可能导致生产中断。

本质上，在Python3中，我需要检测一个短字符串，例如

url_b = 'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

应该被声明为字节串

url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

并将其正确转换，因为它被解释为

url_b
>>> 'http://â\x9e¡.ws/â\x99¥'

edit：我最接近的是url_b.decode('unicode-escape')，它会生成b'http://\\xe2\\x9e\\xa1.ws/\\xe2\\x99\\xa5'

Answer 1

您要.encode()，而不是.decode()和'raw_unicode_escape'：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

url_a = u'http://➡.ws/♥'
url_b =  'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

encoded_a = url_a.encode('utf-8')
try:
    # Python 3
    encoded_b = url_b.encode('raw_unicode_escape')
except UnicodeDecodeError:
    # Python 2
    encoded_b = url_b

print(repr(encoded_a))
print(repr(encoded_b))

# Output is as follows (without the leading 'b' in Python 2):
#   b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
#   b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'

Answer 2

代码：

url_b = b'http://\xe2\x9e\xa1.ws/\xe2\x99\xa5'
print(url_b.decode("utf-8"))

输出：

http://➡.ws/♥

python2到python3的unicode和字节迁移问题

2 个答案: