Question

我希望以下python会话中的getencoding输出为“ISO-8859-1”：

>>> import urllib2
>>> response = urllib2.urlopen("http://www.google.com/")
>>> response.info().plist
['charset=ISO-8859-1']
>>> response.info().getencoding()
'7bit'

这是python版本2.6（'2.6（r26：66714，2009年8月17日，16：01：07）\ n [GCC 4.0.1（Apple Inc. build 5484）]''具体）。

Answer 1

那么，你认为它被破坏了什么？

我获得了urllib和wget的ISO-8859-2（我目前在波兰）。我用Firefox获得UTF-8。这是因为我的Firefox告诉网站它接受ISO-8859-1和UTF-8，而wget和urllib2没有说什么。相关的请求标题是：

Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

从中删除UTF-8，您将无法通过telnet到端口80轻松测试UTF-8。

Google.com简单（合理地）默认为ISO-8859-1，google.pl默认为ISO-8859-2，我确信其他网站还有其他默认设置。

我没有获得wget，urllib2或telnet的编码头，我猜urllib2然后假定为7bit，这可能有点不敏感，因为Content-Encoding通常是gzip或者没有。

Answer 2

根据the document

Message.getencoding（）

返回 Content-Transfer-Encoding 邮件标头中指定的编码。如果不存在这样的标题，则返回'7bit'。编码转换为小写。

urllib2中是否破坏了response.info（）。getencoding（）的实现？

2 个答案: