Question

你好伙计，

我是使用python从网上获取数据的新手。我希望将此页面的源代码放在一个字符串中： https://projects.fivethirtyeight.com/2018-nba-predictions/

以下代码适用于其他网页（例如https://www.basketball-reference.com/boxscores/201712090ATL.html）：

import urllib.request
file = urllib.request.urlopen(webAddress)
data = file.read()
file.close()
dataString = data.decode(encoding='UTF-8')

而且我希望dataString是一串HTML（请参阅下面我在这种特定情况下的期望）

<!DOCTYPE html><html lang="en"><head><meta property="article:modified_time" etc etc

相反，对于538网站，我收到此错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

我的研究表明问题在于我的文件实际上并没有使用UTF-8进行编码，但页面的charset和beautiful-soup的UnicodeDammit（）声称它都是＆＃39 ; s UTF-8（第二个可能是因为第一个）。 chardet.detect（）不建议任何编码。我尝试用以下代替UTF-8＆＃39;在decode（）的编码参数中无效：

ISO-8859-1

Latin-1的

Windows的1252

或许值得一提的是，字节数组数据看起来并不像我期望的那样。来自工作网址的数据[：10]：

b'\n<!DOCTYPE'

来自538网站的数据[：10]：

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

什么了？

Answer 1

服务器为您提供了gzip压缩数据;这并不常见，因为默认情况下urllib并未设置任何accept-encoding值，因此服务器通常保守地不会压缩数据。

仍然，响应的content-encoding字段已设置，因此您可以知道您的页面确实是gzip压缩的，并且您可以使用Python {{}解压缩它。 1}}模块在进一步处理之前。

gzip

OTOH，如果你有可能使用requests模块，它将自己处理所有这些混乱，包括压缩（我提到除了import urllib.request import gzip file = urllib.request.urlopen(webAddress) data = file.read() if file.headers['content-encoding'].lower() == 'gzip': data = gzip.decompress(data) file.close() dataString = data.decode(encoding='UTF-8')之外你还可能得到deflate ，is the same but with different headers？）和（至少部分）编码。

gzip

这将执行您的请求并正确打印出已解码的Unicode字符串。

Answer 2

您正在阅读gzip压缩数据：http://www.forensicswiki.org/wiki/Gzip您必须解压缩它。

为什么我无法解码此UTF-8页面？

2 个答案: