如何在scrapy下载中间件中获得响应体

时间:2017-09-19 12:43:33

标签: python scrapy web-crawler scrapy-spider

如果在页面上找不到某些xpath,我需要能够重试请求。所以我写了这个中间件:

class ManualRetryMiddleware(RetryMiddleware):
    def process_response(self, request, response, spider):
        if not spider.retry_if_not_found:
            return response
        if not hasattr(response, 'text') and response.status != 200:
            return super(ManualRetryMiddleware, self).process_response(request, response, spider)
        found = False
        for xpath in spider.retry_if_not_found:
            if response.xpath(xpath).extract():
                found = True
                break
        if not found:
            return self._retry(request, "Didn't find anything useful", spider)
        return response

并在settings.py注册:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.ManualRetryMiddleware': 650,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

当我运行蜘蛛时,我得到了

AttributeError: 'Response' object has no attribute 'xpath'

我尝试手动创建选择器并在其上运行xpath ...但响应没有text属性,response.body是字节,而不是str ...

那么如何检查中间件中的页面内容?有些页面可能不包含我需要的详细信息,所以我希望以后能够再次尝试。

2 个答案:

答案 0 :(得分:1)

xpath不包含response方法的原因是,下载程序中间件的process_response方法中的xpath参数属于scrapy.http.Response类型,请参阅documentation。只有scrapy.http.TextResponse(和scrapy.http.HtmlResponse)确实有xpath方法。因此,在使用HtmlResponse之前,请从response创建... new_response = scrapy.http.HtmlResponse(response.url, body=response.body) if new_response.xpath(xpath).extract(): found = True break ... 对象。你班级的相应部分将成为:

SELECT issues.id, 
      severity as criticality, 
      IFNULL(rules.name, plugin_rule_key) as name, 
      message, 
      projects.name, 
      projects.kee, 
      projects.long_name, 
      line, 
      rules.plugin_rule_key, 
      rules.plugin_name 
 FROM issues 
      INNER JOIN projects ON projects.uuid = issues.component_uuid 
      INNER JOIN rules ON rules.id = issues.rule_id 
 WHERE issues.status = 'OPEN' AND projects.enabled = 1  AND projects.root_id = XXXXXX;

答案 1 :(得分:1)

还要注意你的中间件位置。它需要在scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware之前,否则,您最终可能会尝试解码压缩数据(这确实不起作用)。检查response.header以了解响应是否已压缩 - Content-Encoding: gzip