Question

我正在尝试使用python3 unicode_escape在字符串中转义\ n，但是挑战在于整个字符串中都存在非ascii字符，如果我使用utf8进行编码，然后使用unicode_escape解码字节，则特殊字符出现乱码。有什么方法可以使\ n换行而又不占用特殊字符？

s = "hello\\nworld└--"
print(s.encode('utf8').decode('unicode_escape'))

Expected Result:
hello
world└--

Actual Result:
hello
worldâ--

Answer 1

正如用户wowcha所观察到的那样，unicode-escape编解码器采用latin-1编码，但是您的字符串包含不可编码为latin-1的字符。

>>> s = "hello\\nworld└--"
>>> s.encode('latin-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)

将字符串编码为utf-8可以解决编码问题，但是从unicode-escape解码时会导致mojibake

解决方案是在编码时使用backslashreplace错误处理程序。这样会将问题字符转换为转义序列，该转义序列可以编码为latin-1，并且从unicode-escape进行解码时不会被弄乱。

>>> s.encode('latin-1', errors='backslashreplace')
b'hello\\nworld\\u2514--'

>>> s.encode('latin-1', errors='backslashreplace').decode('unicode-escape')
'hello\nworld└--'

>>> print(s.encode('latin-1', errors='backslashreplace').decode('unicode-escape'))
hello
world└--

Answer 2

尝试删除第二个转义反斜杠并使用utf8进行解码：

>>> s = "hello\nworld└--"
>>> print(s.encode('utf8').decode('utf8'))
hello
world└--

Answer 3

我相信您遇到的问题是unicode_escape在Python 3.3中已被弃用，并且似乎假设您的代码为'latin-1'，因为这是{{1}中使用的原始编解码器}功能...

从the python documentation for codecs看，我们看到unicode_excape告诉我们Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.假设您的文字是ISO Latin-1。因此，如果我们使用latin1编码运行您的代码，则会收到此错误：

unicode_escape

Unicode字符错误是s.encode('latin1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode character '\u2514' in position 12: ordinal not in range(256)，转换后的错误是'\u2514'，最简单的放置方法是不能在Latin-1字符串中使用该字符，因此为什么要使用其他字符。 / p>

我也认为指出您的字符串中有'└'而不只是'\\n'是正确的，额外的反斜杠表示此符号不是回车符，而是被忽略，反斜杠表示忽略'\n'。也许尝试不使用'\n' ...

python 3中具有非ASCII字符的编码/解码问题

3 个答案: