Question

我正在使用带有请求模块的Anaconda Python 3.7 Jupyter Notebook从网站上抓取一些视频游戏数据。

该游戏“BrütalLegend”具有变音符号，并正确显示在我要从其抓取的网站上，但是当我通过请求模块获取数据时，它的显示不再带有特殊字符。例如，这就是我得到的：

末日传说

这是我的代码：

import requests

targetURL = 'https://www.url.com/redacted.php?query'
r = requests.get(targetURL)
page_source = r.text
print("raw page_source", page_source)

我该怎么做才能保留特殊字符，以使其在Jupyter Notebook的输出中正确显示？

Answer 1

即使大多数网站都使用utf8，您也需要知道Response Content-Type中的字符集。 response.text将使用默认编码UTF8，因为它使用decode()并且响应默认编码为None。

注意：许多网站未显示字符集，但它们可能使用utf8。

http://docs.python-requests.org/en/master/api/?highlight=encod#requests.Response.encoding

所以为什么得到BrÃ¼tal Legend是因为您使用错误的编码将字节转换为字符串。您应该尝试r.content.decode("ISO-8859-1")

一个简单的例子：

import requests
with requests.Session() as s:
    utf_8 = s.get("https://en.wikipedia.org/wiki/Br%C3%BCtal_Legend")
    #response charset is UTF8
    print(utf_8.text[101:126])
    print(utf_8.content.decode("utf8")[101:126])

    print(utf_8.content[101:127].decode("ISO-8859-1"))

输出：

Brütal Legend - Wikipedia
Brütal Legend - Wikipedia
BrÃ¼tal Legend - Wikipedia

编辑：

print("BrÃ¼tal Legend".encode("ISO-8859-1").decode())
#Brütal Legend

使用请求模块时保留特殊字符

1 个答案: