Question

我在Python urllib.request documentation中找到了以下示例：

from urllib.request import urlopen
with urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl') as response:
    for line in response:
        line = line.decode('utf-8')
        if 'EST' in line or 'EDT' in line:
             print(line)

此输出

Nov. 25, 09:43:32 PM EST

我尝试复制该代码以用于中文网站：

import urllib.request

url = 'http://www.zhihu.com'
response = urllib.request.urlopen(url).read().decode("utf-8")
print(response)

但是我收到了错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.

我该如何解决这个问题？

Answer 1

网站会返回一个gzip压缩响应，即使您没有请求一个：

>>> from urllib.request import urlopen
>>> url = 'http://www.zhihu.com'
>>> response = urlopen(url)
>>> response.info().get('Content-Encoding')
'gzip'

这违反了HTTP RFC，即使您明确禁止它，网站也会这样做：

>>> from urllib.request import Request
>>> response = urlopen(Request(url, headers={'Accept-Encoding': 'identity,gzip;q=0'}))
>>> response.info().get('Content-Encoding')
'gzip'

您必须首先解压缩响应数据，然后才能将生成的字节解码为UTF-8：

>>> import zlib
>>> decompressed_data = zlib.decompress(response.read(), 16+zlib.MAX_WBITS)
>>> print(*decompressed_data.decode('utf8').splitlines(True)[:10])
<!DOCTYPE html>
 <html lang="zh-CN">
 <head>
 <meta charset="utf-8">
 <meta name="apple-itunes-app" content="app-id=432274380">
 <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
 <meta name="renderer" content="webkit" />
 <meta name="description" content="一个真实的网络问答社区，帮助你寻找答案，分享知识。"/>
 <meta name="viewport" content="user-scalable=no, width=device-width, initial-scale=1.0, maximum-scale=1.0"/>
 <title>知乎 - 与世界分享你的知识、经验和见解</title>

为什么我不能使用urllib.request.urlopen（url）.read（）解码zhihu.com响应？

1 个答案: