Question

我有以下功能

import urllib.request

def seek():
    web = urllib.request.urlopen("http://wecloudforyou.com/")
    text = web.read().decode("utf8")
    return text
texto = seek()
print(texto)

当我解码为utf-8时，我得到带缩进和回车的html代码，就像它在实际网站上看到的一样。

<!DOCTYPE html>
<html>
    <head>
       <title>We Cloud for You |

如果我删除.decode('utf8')，我会收到代码，但缩进消失了，它被\n取代。

<!DOCTYPE html>\n<html>\n    <head>\n       <title>We Cloud for You

那么，为什么会这样呢？据我所知，当您解码时，您基本上将一些编码的字符串转换为Unicode。

我的sys.stdout.encoding是CP1252（Windows 1252编码）

根据这个帖子：Why does Python print unicode characters when the default encoding is ASCII?

Python将非unicode字符串作为原始数据输出，而不考虑它的默认编码。如果终端恰好显示它们当前编码与数据匹配。 - Python输出Unicode字符串在使用sys.stdout.encoding中指定的方案对它们进行编码之后。 - Python从shell的环境中获取该设置。 - 终端根据自己的编码设置显示输出。 - 终端的编码与shell的编码无关。

因此，似乎python需要先读取Unicode中的文本才能将其转换为CP1252，然后将其打印在终端上。但我不明白为什么如果文本没有被解码，它会用\n替换缩进。

sys.getdefaultencoding()返回utf8。

Answer 1

在Python 3中，当您传递一个字节值（来自网络的原始字节而不进行解码）时，您会看到字节值的表示为Python字节文字。这包括将换行符表示为\n个字符。

通过解码，您现在拥有一个unicode字符串值，而print()可以直接处理：

>>> print(b'Newline\nAnother line')
b'Newline\nAnother line'
>>> print(b'Newline\nAnother line'.decode('utf8'))
Newline
Another line

这是完全正常的行为。

当我不解码为utf-8时，Python意外的行为

1 个答案: