Question

我使用以下python代码从使用gzip压缩的服务器下载网页：

url = "http://www.v-gn.de/wbb/"
import urllib2
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
content = response.read()
response.close()

import gzip
from StringIO import StringIO
html = gzip.GzipFile(fileobj=StringIO(content)).read()

这通常有效，但是对于指定的URL失败并出现struct.error异常。如果我使用带有“Accept-encoding”标头的wget，我会得到类似的结果。但是，浏览器似乎能够解压缩响应。

所以我的问题是：有没有办法让我的python代码解压缩HTTP响应，而无需通过删除“Accept-encoding”标头来禁用压缩？

为了完整性，这是我用于wget的行：

wget --user-agent="Mozilla" --header="Accept-Encoding: gzip,deflate" http://www.v-gn.de/wbb/

Answer 1

您似乎可以在readline()对象上调用gzip.GzipFile，但是 read()引发struct.error因为文件突然结束。

由于readline有效（除了最后），你可以这样做：

import urllib2
import StringIO
import gzip
import struct

url = "http://www.v-gn.de/wbb/"
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
content = response.read()
response.close()
fh=StringIO.StringIO(content)
html = gzip.GzipFile(fileobj=StringIO.StringIO(content))
try:
    for line in html:
        line=line.rstrip()
        print(line)
except struct.error:
    pass

Answer 2

我运行了您指定的命令。它将gzip-ed数据下载到index.html。我将index.html重命名为index.html.gz。我尝试gzip -d inedx.html.gz导致错误：gzip: index.html.gz: unexpected end of file。

第二次尝试是zcat index.html.gz 工作很好，除了在</html>标记之后它打印出与上面相同的错误。

$ zcat index.html.gz
...
  </td>
 </tr>
</table>


</body>
</html>
gzip: index.html.gz: unexpected end of file

服务器出现故障。

Answer 3

通过从urllib2.HTTPHandler派生并覆盖http_open（）来创建处理程序。

import gzip
from StringIO import StringIO
import httplib, urllib, urllib2
class GzipHandler(urllib2.HTTPHandler):
    def http_open(self, req):
        req.add_header('Accept-encoding', 'gzip')
        r = self.do_open(httplib.HTTPConnection, req)
        if (
            'Content-Encoding'in r.headers and
            r.headers['Content-Encoding'] == 'gzip'
        ):
            fp = gzip.GzipFile(fileobj=StringIO(r.read()))
        else:
            fp = r
        response = urllib.addinfourl(fp, r.headers, r.url, r.code)
        response.msg = r.msg
        return respsone

然后建立你的开场白。

def retrieve(url):
    request = urllib2.Request(url)
    opener = urllib2.build_opener(GzipHandler)
    return opener.open(request)

不同之处在于此方法检查服务器是否在请求期间返回gzip响应及其完成。

有关详细信息，请参阅：

这个gzip格式出了什么问题？

3 个答案: