我在Python urllib.request
documentation中找到了以下示例:
from urllib.request import urlopen
with urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl') as response:
for line in response:
line = line.decode('utf-8')
if 'EST' in line or 'EDT' in line:
print(line)
此输出
Nov. 25, 09:43:32 PM EST
我尝试复制该代码以用于中文网站:
import urllib.request
url = 'http://www.zhihu.com'
response = urllib.request.urlopen(url).read().decode("utf-8")
print(response)
但是我收到了错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.
我该如何解决这个问题?
答案 0 :(得分:2)
网站会返回一个gzip压缩响应,即使您没有请求一个:
>>> from urllib.request import urlopen
>>> url = 'http://www.zhihu.com'
>>> response = urlopen(url)
>>> response.info().get('Content-Encoding')
'gzip'
这违反了HTTP RFC,即使您明确禁止它,网站也会这样做:
>>> from urllib.request import Request
>>> response = urlopen(Request(url, headers={'Accept-Encoding': 'identity,gzip;q=0'}))
>>> response.info().get('Content-Encoding')
'gzip'
您必须首先解压缩响应数据,然后才能将生成的字节解码为UTF-8:
>>> import zlib
>>> decompressed_data = zlib.decompress(response.read(), 16+zlib.MAX_WBITS)
>>> print(*decompressed_data.decode('utf8').splitlines(True)[:10])
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="utf-8">
<meta name="apple-itunes-app" content="app-id=432274380">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="renderer" content="webkit" />
<meta name="description" content="一个真实的网络问答社区,帮助你寻找答案,分享知识。"/>
<meta name="viewport" content="user-scalable=no, width=device-width, initial-scale=1.0, maximum-scale=1.0"/>
<title>知乎 - 与世界分享你的知识、经验和见解</title>