如果在页面上找不到某些xpath,我需要能够重试请求。所以我写了这个中间件:
class ManualRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if not spider.retry_if_not_found:
return response
if not hasattr(response, 'text') and response.status != 200:
return super(ManualRetryMiddleware, self).process_response(request, response, spider)
found = False
for xpath in spider.retry_if_not_found:
if response.xpath(xpath).extract():
found = True
break
if not found:
return self._retry(request, "Didn't find anything useful", spider)
return response
并在settings.py
注册:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.ManualRetryMiddleware': 650,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
当我运行蜘蛛时,我得到了
AttributeError: 'Response' object has no attribute 'xpath'
我尝试手动创建选择器并在其上运行xpath ...但响应没有text
属性,response.body
是字节,而不是str ...
那么如何检查中间件中的页面内容?有些页面可能不包含我需要的详细信息,所以我希望以后能够再次尝试。
答案 0 :(得分:1)
xpath
不包含response
方法的原因是,下载程序中间件的process_response
方法中的xpath
参数属于scrapy.http.Response
类型,请参阅documentation。只有scrapy.http.TextResponse
(和scrapy.http.HtmlResponse
)确实有xpath
方法。因此,在使用HtmlResponse
之前,请从response
创建...
new_response = scrapy.http.HtmlResponse(response.url, body=response.body)
if new_response.xpath(xpath).extract():
found = True
break
...
对象。你班级的相应部分将成为:
SELECT issues.id,
severity as criticality,
IFNULL(rules.name, plugin_rule_key) as name,
message,
projects.name,
projects.kee,
projects.long_name,
line,
rules.plugin_rule_key,
rules.plugin_name
FROM issues
INNER JOIN projects ON projects.uuid = issues.component_uuid
INNER JOIN rules ON rules.id = issues.rule_id
WHERE issues.status = 'OPEN' AND projects.enabled = 1 AND projects.root_id = XXXXXX;
答案 1 :(得分:1)
还要注意你的中间件位置。它需要在scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
之前,否则,您最终可能会尝试解码压缩数据(这确实不起作用)。检查response.header以了解响应是否已压缩 - Content-Encoding: gzip
。