Question

请求模块encoding提供的编码与HTML页面中的实际设置编码不同

代码：

import requests
URL = "http://www.reynamining.com/nuevositio/contacto.html"
obj = requests.get(URL, timeout=60, verify=False, allow_redirects=True)
print obj.encoding

输出：

ISO-8859-1

HTML中的实际编码设置为UTF-8 content="text/html; charset=UTF-8"

我的问题是：

为什么requests.encoding显示的编码与HTML页面中描述的编码不同？。

我正在尝试使用此方法objReq.content.decode(encodes).encode("utf-8")将编码转换为UTF-8，因为当我使用ISO-8859-1进行解码并使用UTF-8编码时，它已经在UTF-8中变更ie）á对此Ã

有没有办法将所有类型的编码转换为UTF-8？

Answer 1

请求在response.encoding响应时将ISO-8859-1属性设置为text/*，并且在响应标头中未指定内容类型。

请参阅Encoding section of the Advanced documentation：

请求不会执行此操作的唯一情况是，如果HTTP标头中没有明确的字符集和，Content-Type标头包含text。 在这种情况下，RFC 2616指定默认字符集必须为ISO-8859-1 。在这种情况下，请求遵循规范。如果您需要不同的编码，可以手动设置Response.encoding属性，或使用原始Response.content。

大胆强调我的。

您可以通过在charset标题中查找Content-Type参数进行测试：

resp = requests.get(....) encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None

您的HTML文档在<meta>标题中指定了内容类型，并且此标题是权威的：

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

HTML 5还定义了<meta charset="..." />标记，请参阅<meta charset="utf-8"> vs <meta http-equiv="Content-Type">

如果HTML页面包含具有不同编解码器的标头，则不将HTML页面重新编码为UTF-8。在这种情况下，你必须至少更正那个标题。

使用BeautifulSoup：

# pass in explicit encoding if set as a header encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None content = resp.content soup = BeautifulSoup(content, from_encoding=encoding) if soup.original_encoding != 'utf-8': meta = soup.select_one('meta[charset], meta[http-equiv="Content-Type"]') if meta: # replace the meta charset info before re-encoding if 'charset' in meta.attrs: meta['charset'] = 'utf-8' else: meta['content'] = 'text/html; charset=utf-8' # re-encode to UTF-8 content = soup.prettify() # encodes to UTF-8 by default

同样，其他文件标准也可以指定具体的编码;例如，XML总是UTF-8，除非由<?xml encoding="..." ... ?> XML声明指定，再次是文档的一部分。

Answer 2

请求将首先检查HTTP标头中的编码：

print obj.headers['content-type']

输出：

text/html

没有正确解析编码类型，因此它指定默认的ISO-8859-1。

在docs中查看更多内容。

Answer 3

请求replies on HTTP position: relative;响应标头和Content-Type。对于chardet的常见情况，它假设默认值为text/html。问题在于，请求对HTML元标记一无所知，HTML元标记可以指定其他文本编码，例如ISO‌-8859-1或<meta charset="utf-8">。

一个好的解决方案是使用BeautifulSoup的“ Unicode, Dammit”功能，如下所示：

<meta http-equiv="content-type" content="text/html; charset=UTF‌-8">

请求模块编码提供不同的编码然后HTML编码

3 个答案: