Question

我想（轻轻地）抓取一个网站并下载我抓取的每个HTML页面。为此，我使用了库请求。我已经完成了爬行列表，我尝试使用urllib.open抓取它们但没有用户代理，我收到一条错误消息。所以我选择使用请求，但我真的不知道如何使用它。

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'
}
page = requests.get('http://www.xf.com/ranking/get/?Amount=1&From=left&To=right', headers=headers)
with open('pages/test.html', 'w') as outfile:
     outfile.write(page.text)

问题是当脚本尝试在我的文件中写入响应时，我得到一些编码错误：

UnicodeEncodeError: 'ascii' codec can't encode characters in position 6673-6675: ordinal not in range(128)

如何在没有编码问题的情况下写入文件？

Answer 1

在Python 2中，文本文件不接受Unicode字符串。使用response.content访问原始二进制，未解码的内容：

with open('pages/test.html', 'w') as outfile:
    outfile.write(page.content)

这将以网站提供的原始编码编写下载的HTML。

或者，如果要将所有响应重新编码为特定编码，请使用io.open()生成一个接受Unicode的文件对象：

import io

with io.open('pages/test.html', 'w', encoding='utf8') as outfile:
    outfile.write(page.text)

请注意，许多网站依赖于在 HTML标记中发信号通知正确的编解码器，并且可以在没有字符集参数的情况下提供内容。

在这种情况下，requests使用默认编解码器为text/* mimetype，Latin-1，将HTML解码为Unicode文本。 这通常是错误的编解码器，依赖此行为可能会导致Mojibake稍后输出。我建议你坚持编写二进制内容，并依靠BeautifulSoup等工具来检测以后的正确编码。

或者，明确测试所存在的charset参数，并且仅在response.text没有回归到io.open()时重新编码（通过requests和$('.ca-menu li').click(function() { $('.ca-menu li').removeClass("active"); $(this).addClass("active"); });或其他方式） Latin-1默认值。请参阅 retrieve links from web page using python and BeautifulSoup 以获取答案，其中我使用此方法告诉BeautifulSoup要使用哪种编解码器。

Answer 2

outfile.write(page.text.encode('utf8', 'replace'))

我在这里找到了文档：unicode problem

Python爬虫：下载HTML页面

2 个答案: