Question

我正在抓取一个网站，它包含许多需要获取数据的网址。我使用了XPath并获取了所有href s（URL）并保存到列表中。我循环了这个列表并提出了一个请求。下面是我的蜘蛛代码，

class ExampledotcomSpider(BaseSpider):
   name = "exampledotcom"
   allowed_domains = ["www.example.com"]
   start_urls = ["http://www.example.com/movies/city.html"]


   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       cinema_links = hxs.select('//div[@class="contentArea"]/div[@class="leftNav"]/div[@class="cinema"]/div[@class="rc"]/div[@class="il"]/span[@class="bt"]/a/@href').extract()
       for cinema_hall in cinema_links:
            yield Request(cinema_hall, callback=self.parse_cinema)


   def parse_cinema(self, response):
       hxs = HtmlXPathSelector(response)
       cinemahall_name = hxs.select('//div[@class="companyDetails"]/div[@itemscope=""]/span[@class="srchrslt"]/h1/span/text()').extract()
       ........

例如，在这里，我在列表中有60个网址，并且没有下载大约37个网址：对于这些网址，出现了错误消息：

2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-70mm-%3Cnear%3E-place/040PXX40-XX40-000147377847-A6M3>: Error -3 while decompressing: invalid stored block lengths
2012-06-06 14:00:12+0530 [exampledotcom] ERROR: Error downloading <GET http://www.example.com/city/Cinema-Hall-35mm-%3Cnear%3E-place/040PXX40-XX40-000164969686-H9C5>: Error -3 while decompressing: invalid stored block lengths

只有Scrapy正在下载的某些网址，其余的，我不明白我的代码发生了什么以及我的代码有什么问题。

有人可以建议我如何删除这些错误吗？

Answer 1

我认为您的代码没有任何问题。

解压缩时出错-3：存储的块长度无效
CRC校验失败0x471e6e9a！= 0x7c07b839L
解压缩时出错-3：无效块类型

所有这些错误似乎都与gzip解压缩有关。我认为您尝试访问的网站具有响应标头Accept-Encoding: gzip, deflate

gzip文件压缩程序生成的编码格式 RFC 1952 [25]中描述的“gzip”（GNU zip）。这种格式是 Lempel-Ziv编码（LZ77），带32位CRC。

另见http://en.wikipedia.org/wiki/HTTP_compression

所以我认为这只是一个破坏的网络服务器托管页面scrapy正在尝试下载。

更新：

尝试停用HttpCompressionMiddleware

解压缩时出错-3：存储的块长度无效

1 个答案: