Question

我正在创建一个系统，其中所有网址，HTML，文本，链接等都以unicode格式存储。为此，我从网页中提取html并使用此处粘贴的代码将其转换为unicode。我试过的一些链接工作正常。其他人喜欢下面我的源代码中的链接会引发错误。我该如何解决这个问题？

import urllib2
from cookielib import CookieJar
cj = CookieJar()
url = 'http://www.economist.com/news/leaders/21596515-there-are-lessons-many-governments-one-countrys-100-years-decline-parable'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11 Chrome/32.0.1700.77 Safari/537.36'), ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'), ('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'), ('Accept-Encoding','gzip,deflate,sdch'), ('Connection', 'keep-alive')]
resp = opener.open(url, timeout=5)
raw_html = resp.read()
raw_html.decode('utf-8')

给出错误：

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte

Answer 1

返回数据由GZip压缩。

您可以尝试解压缩它：

try:
    raw_html = GzipFile(fileobj=StringIO(raw_html)).read()
except:
    pass

或者，您可以发送标头Accept-Encoding: deflate（没有'gzip'）
```
opener.addheaders = [('Accept-Encoding', 'deflate'), ]
```

Python HTML转储unicode错误

1 个答案: