Question

我尝试使用urllib.request的{{1}}方法解析网页，例如：

urlopen()

但是，最后一行以字节为单位返回结果。

所以我尝试解码它，比如：

from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()

但是，发生了错误：

UnicodeDecodeError：'utf-8'编解码器无法解码位置1中的字节0x8b：无效的起始字节。

通过一些研究，我发现了one related answer，它解析html = urlopen(req).read().decode("utf-8")来决定解码。但是，该页面不会返回charset，当我在Chrome Web Inspector上尝试检查它时，其标题中写入了以下行：

charset

那为什么我不能用<meta charset="utf-8">解码呢？我怎样才能成功解析网页？

网站网址为utf-8，我想将图片保存到我的磁盘。

请注意，我使用的是Python 3.5.1。我还注意到，我上面写的所有工作在我的其他编写程序中运行良好。

Answer 1

内容使用gzip进行压缩。你需要解压缩它：

import gzip
from urllib.request import Request, urlopen

req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')

如果您使用requests，它会自动为您解压缩：

import requests
html = requests.get(url).text  # => str, not bytes