为什么我不能使用urllib.request.urlopen(url).read()解码zhihu.com响应?

时间:2015-07-30 17:35:26

标签: python unicode urllib python-3.4

我在Python urllib.request documentation中找到了以下示例:

from urllib.request import urlopen
with urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl') as response:
    for line in response:
        line = line.decode('utf-8')
        if 'EST' in line or 'EDT' in line:
             print(line)

此输出

Nov. 25, 09:43:32 PM EST

我尝试复制该代码以用于中文网站:

import urllib.request

url = 'http://www.zhihu.com'
response = urllib.request.urlopen(url).read().decode("utf-8")
print(response) 

但是我收到了错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte. 

我该如何解决这个问题?

1 个答案:

答案 0 :(得分:2)

网站会返回一个gzip压缩响应,即使您没有请求一个:

>>> from urllib.request import urlopen
>>> url = 'http://www.zhihu.com'
>>> response = urlopen(url)
>>> response.info().get('Content-Encoding')
'gzip'

这违反了HTTP RFC,即使您明确禁止它,网站也会这样做:

>>> from urllib.request import Request
>>> response = urlopen(Request(url, headers={'Accept-Encoding': 'identity,gzip;q=0'}))
>>> response.info().get('Content-Encoding')
'gzip'

您必须首先解压缩响应数据,然后才能将生成的字节解码为UTF-8:

>>> import zlib
>>> decompressed_data = zlib.decompress(response.read(), 16+zlib.MAX_WBITS)
>>> print(*decompressed_data.decode('utf8').splitlines(True)[:10])
<!DOCTYPE html>
 <html lang="zh-CN">
 <head>
 <meta charset="utf-8">
 <meta name="apple-itunes-app" content="app-id=432274380">
 <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
 <meta name="renderer" content="webkit" />
 <meta name="description" content="一个真实的网络问答社区,帮助你寻找答案,分享知识。"/>
 <meta name="viewport" content="user-scalable=no, width=device-width, initial-scale=1.0, maximum-scale=1.0"/>
 <title>知乎 - 与世界分享你的知识、经验和见解</title>