Question

在Python 3.5 + .decode("utf-8", "backslashreplace")中，处理部分Unicode，部分未知的遗留编码二进制字符串是一个非常好的选择。将解码有效的UTF-8序列，并将无效的序列保留为转义序列。例如

>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
¡\xa1

这失去了b'\xc2\xa1\xa1'和b'\xc2\xa1\\xa1'之间的区别，但是如果你在＆＃34;只是让我某些没有太多的损失，我可以稍后用手修好＆＃34;心境，这可能还不错。

然而，这是Python 3.5中的一项新功能。我正在开发的程序也需要支持3.4和2.7。在这些版本中，它会引发异常：

>>> print(b'\xc2\xa1\xa1'.decode("utf-8", "backslashreplace"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
TypeError: don't know how to handle UnicodeDecodeError in error callback

我找到了一个近似值，但不是一个确切的等价物：

>>> print(b'\xc2\xa1\xa1'.decode("latin1")
...       .encode("ascii", "backslashreplace").decode("ascii"))
\xc2\xa1\xa1

行为不依赖于解释器版本非常重要。任何人都可以建议一种方法来获得完全 2.7和3.4中的Python 3.5行为吗？

（2.x或3.x的较旧版本不需要工作。猴子修补codecs完全可以接受。）

Answer 1

我尝试了cpython implementation

的更完整的后退

这会处理UnicodeDecodeError（来自.decode()）以及来自UnicodeEncodeError的{{1}}和来自.encode()的{{1}}：

UnicodeTranslateError

Answer 2

您可以编写自己的错误处理程序。这是我在Python 2.7,3.3和3.6上测试的解决方案：

from __future__ import print_function
import codecs
import sys

print(sys.version)

def myreplace(ex):
    # The error handler receives the UnicodeDecodeError, which contains arguments of the
    # string and start/end indexes of the bad portion.
    bstr,start,end = ex.object,ex.start,ex.end

    # The return value is a tuple of Unicode string and the index to continue conversion.
    # Note: iterating byte strings returns int on 3.x but str on 2.x
    return u''.join('\\x{:02x}'.format(c if isinstance(c,int) else ord(c))
                    for c in bstr[start:end]),end

codecs.register_error('myreplace',myreplace)
print(b'\xc2\xa1\xa1ABC'.decode("utf-8", "myreplace"))

输出：

C:\>py -2.7 test.py
2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)]
¡\xa1ABC

C:\>py -3.3 test.py
3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)]
¡\xa1ABC

C:\>py -3.6 test.py
3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)]
¡\xa1ABC

在Python 2中确切等同于'b＆＃39; ...＆＃39; .decode（＆＃34; utf-8＆＃34;，＆＃34; backslashreplace＆＃34;）`

2 个答案: