Question

我有点惊讶的是，使用Python获取网页的字符集非常复杂。我错过了一条路吗？ HTTPMessage有很多函数，但不是这个。

>>> google = urllib2.urlopen('http://www.google.com/')
>>> google.headers.gettype()
'text/html'
>>> google.headers.getencoding()
'7bit'
>>> google.headers.getcharset()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: HTTPMessage instance has no attribute 'getcharset'

所以你必须得到标题，并拆分它。两次。

>>> google = urllib2.urlopen('http://www.google.com/')
>>> charset = 'ISO-8859-1'
>>> contenttype = google.headers.getheader('Content-Type', '')
>>> if ';' in contenttype:
...     charset = contenttype.split(';')[1].split('=')[1]
>>> charset
'ISO-8859-1'

对于这样的基本功能来说，这是一个惊人的步骤。我错过了什么吗？

Answer 1

你检查过这个吗？

How to download any(!) webpage with correct charset in python?

Answer 2

我做了一些研究并提出了这个解决方案：

response = urllib.request.urlopen(url)
encoding = response.headers.get_content_charset()

这就是我在Python 3中的方法。我没有在Python 2中测试它，但我猜你必须使用urllib2.request而不是urllib.request。

以下是它的工作原理，因为官方Python文档没有很好地解释它：urlopen的结果是http.client.HTTPResponse对象。此对象的headers属性是http.client.HTTPMessage对象，根据文档，“使用email.message.Message类”实现，该类具有名为get_content_charset的方法，它试图确定并返回响应的字符集。

默认情况下，如果此方法无法确定字符集，则返回None，但您可以通过传递failobj参数来覆盖此行为：

encoding = response.headers.get_content_charset(failobj="utf-8")

Answer 3

你没有遗漏任何东西。它正在做正确的事情 - HTTP响应的编码是Content-Type的子部分。

另请注意，某些网页可能只发送Content-Type: text/html，然后通过<meta http-equiv="Content-Type" content="text/html; charset=utf-8">设置编码 - 虽然这是一个丑陋的黑客（在页面作者方面）并且不太常见。

Answer 4

我会使用chardet通用编码检测器。

>>> import urllib
>>> urlread = lambda url: urllib.urlopen(url).read()
>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}

您做得对，但对于在meta标记上声明字符集或根本未声明charset的网页，您的方法将失败。
如果你仔细观察Chardet的来源，它有一个charsetprober/charsetgroupprober模块可以很好地处理这个问题。

获取网页字符集的好方法，可靠的简短方法是什么？

4 个答案: