Question

如果在页面上找不到某些xpath，我需要能够重试请求。所以我写了这个中间件：

class ManualRetryMiddleware(RetryMiddleware):
    def process_response(self, request, response, spider):
        if not spider.retry_if_not_found:
            return response
        if not hasattr(response, 'text') and response.status != 200:
            return super(ManualRetryMiddleware, self).process_response(request, response, spider)
        found = False
        for xpath in spider.retry_if_not_found:
            if response.xpath(xpath).extract():
                found = True
                break
        if not found:
            return self._retry(request, "Didn't find anything useful", spider)
        return response

并在settings.py注册：

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.ManualRetryMiddleware': 650,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

当我运行蜘蛛时，我得到了

AttributeError: 'Response' object has no attribute 'xpath'

我尝试手动创建选择器并在其上运行xpath ...但响应没有text属性，response.body是字节，而不是str ...

那么如何检查中间件中的页面内容？有些页面可能不包含我需要的详细信息，所以我希望以后能够再次尝试。

Answer 1

xpath不包含response方法的原因是，下载程序中间件的process_response方法中的xpath参数属于scrapy.http.Response类型，请参阅documentation。只有scrapy.http.TextResponse（和scrapy.http.HtmlResponse）确实有xpath方法。因此，在使用HtmlResponse之前，请从response创建... new_response = scrapy.http.HtmlResponse(response.url, body=response.body) if new_response.xpath(xpath).extract(): found = True break ...对象。你班级的相应部分将成为：

SELECT issues.id, 
      severity as criticality, 
      IFNULL(rules.name, plugin_rule_key) as name, 
      message, 
      projects.name, 
      projects.kee, 
      projects.long_name, 
      line, 
      rules.plugin_rule_key, 
      rules.plugin_name 
 FROM issues 
      INNER JOIN projects ON projects.uuid = issues.component_uuid 
      INNER JOIN rules ON rules.id = issues.rule_id 
 WHERE issues.status = 'OPEN' AND projects.enabled = 1  AND projects.root_id = XXXXXX;

Answer 2

还要注意你的中间件位置。它需要在scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware之前，否则，您最终可能会尝试解码压缩数据（这确实不起作用）。检查response.header以了解响应是否已压缩 - Content-Encoding: gzip。

如何在scrapy下载中间件中获得响应体

2 个答案: