Question

这是我的代码：

dataFile = open('dataFile.html', 'w')
res = requests.get('site/pm=' + str(i))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
linkElems = soup.select('#content')
dataFile.write(str(linkElems[0]))

我还有一些其他代码，但这是我认为有问题的代码。我也尝试过使用：

dataFile.write(str(linkElems[0].decode('utf-8')))

但这不起作用并给出错误。

使用dataFile = open('dataFile.html', 'wb')给出了错误：

a bytes-like object is required, not 'str'

Answer 1

您打开文本文件时未指定编码：

dataFile = open('dataFile.html', 'w')

这告诉Python使用系统的默认编解码器。您尝试写入的每个Unicode字符串都将编码到该编解码器，并且您的Windows系统未设置为默认的UTF-8。

明确指定编码：

dataFile = open('dataFile.html', 'w', encoding='utf8')

接下来，您信任HTTP服务器以了解HTML数据使用的编码。通常没有设置，所以不要使用response.text！这里不是BeautifulSoup，你正在重新编码Mojibake。当服务器未明确指定编码时，requests库将默认为text/*内容类型的Latin-1编码，因为HTTP标准声明这是默认值。

请参阅Encoding section of the Advanced documentation：

请求不会执行此操作的唯一情况是，如果HTTP标头中没有明确的字符集和，Content-Type标头包含text。 在这种情况下，RFC 2616指定默认字符集必须为ISO-8859-1 。在这种情况下，请求遵循规范。如果您需要不同的编码，可以手动设置Response.encoding属性，或使用原始Response.content。

大胆强调我的。

传递response.content原始数据：

soup = bs4.BeautifulSoup(res.content, 'html.parser')

BeautifulSoup 4通常可以很好地确定在解析时使用的正确编码，无论是来自HTML <meta>标记还是对提供的字节的统计分析。如果服务器确实提供了字符集，您仍然可以从响应中将其传递给BeautifulSoup，但如果requests使用默认值，则先测试：

encoding = res.encoding if 'charset' in res.headers.get('content-type', '').lower() else None
soup = bs4.BeautifulSoup(res.content, 'html.parser', encoding=encoding)

'charmap'编解码器在解析HTML时无法编码Python中的字符错误

1 个答案: