Question

我根本不是一位经验丰富的编码员，所以我事先表示歉意。

我经常使用BeautifulSoup等进行简单的Web抓取，然后继续前进。最近，在某些网站上，我遇到了一个问题，我似乎无法搜索或弄清楚自己。

    r = requests.get('https://www.sneakersnstuff.com/', headers=headers)
    print(r.text)

打印时，与通常不同，它看起来像like this。预先感谢！

编辑： r.content也不起作用。只是一堆'\ x83 \ xff \ x7f \ x8cH \ xcd \ xea \'等。

标题：

        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,ko-KR;q=0.8,ko;q=0.7',
'cache-control': 'max-age=0',
'referer': 'https://www.sneakersnstuff.com/en/858/new-arrivals',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'

Answer 1

删除'accept-encoding'标头。看来您看到的是压缩的东西。

Answer 2

您应该阅读有关Unicode

的更多信息

这将暂时解决您的问题，但这不是正确的方法。阅读完有关Unicode的更多信息后，您将了解为什么以下解决方案不能始终有效。

r = requests.get('https://www.sneakersnstuff.com/', headers=headers)
print(r.text.encode('ascii', 'ignore').decode('ascii'))

Answer 3

来自Response.text's documentation：

响应的内容，采用Unicode。

如果Response.encoding为None，则将使用chardet猜测编码。

仅根据以下内容确定响应内容的编码   HTTP标头，紧随RFC 2616之后。如果可以   非HTTP知识的优势，可以更好地猜测   编码，您应该在访问之前适当设置r.encoding   此属性。

换句话说，由于页眉中缺少此类信息，Response.text对网页内容的编码进行了错误的猜测。

您需要使用以下内容指定内容的编码：

r.encoding = 'utf-16' # or whatever the encoding of the content really is

在访问r.text之前。

获取请求响应文本打印无效字符

3 个答案: