Question

其中一个库文件Python 3.3 email/utils.py具有以下代码：

_has_surrogates = re.compile(
'([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)').search

此代码以Python字节码进行编码。

在我的跨平台反汇编程序xdis中，这是Python反编译程序uncompyle6所需要的，我想从读取的字节数组中生成类似的字符串。使用{{1 }}我得到的字节串是：

unicodestring = fp.read(strsize)

如果尝试b'([^\xed\xa0\x80-\xed\xaf\xbf]|\\A)[\xed\xb0\x80-\xed\xbf\xbf]([^\xed\xb0\x80-\xed\xbf\xbf]|\\Z)'，我会得到：

unicodestring.decode('utf-8')

我知道要添加** UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3: invalid continuation byte，但是我想处理它们而不是删除它们。我可以从, errors='ignore'切换为utf-8，但是我得到的字符串是：

latin-1

我不确定也是正确的。即使是正确的，Python程序是否也不需要其他标记来表示该程序包含latin-1字符串？

也许有使用'([^í\xa0\x80-í¯¿]|\\A)[í°\x80-í¿¿]([^í°\x80-í¿¿]|\\Z)'的解决方案？给出：

unicodestring.decode('utf-8', 'surrogateescape)

但是在这里，我想我需要对字符串进行后处理以删除“代理”，对吗？

如何在Python 3.6中解码unicode字符串转义

0 个答案: