我自己有这个
from scrapy.downloadermiddlewares.retry import RetryMiddleware
class Retry(RetryMiddleware):
def process_response(self, request, response, spider):
if response.status == '503':
logger.error("503 status returned: " + response.url)
return self._retry(request,response, spider) or response
logger.debug("response.status = "+str(response.status)+" from URL "+str(response.url))
logger.debug(response.headers)
return super(Retry, self).process_response(request, response, spider)
def _retry(self, request,response, spider):
logger.debug("Deleting session "+str(request.meta['sessionId']))
self.delete_session(request.meta['sessionId'])
logger.debug("Retrying URL: %(request)s", {'request': request})
logger.debug("Response headers were:")
logger.debug(request.headers)
retryreq = request.copy()
retryreq.headers['Authorization'] = crawlera_auth.strip()
retryreq.headers['X-Crawlera-Session'] = 'create'
retryreq.dont_filter = True
return retryreq
在settings.py
我有这个
DOWNLOADER_MIDDLEWARES = {
'craigslist_tickets.retrymiddleware.Retry': 100,
'craigslist_tickets.crawlera_proxy_middleware.CrawleraProxyMiddleware': 200
}
对于所有已成功抓取的网址,我都可以看到response.status = 200
之类的输出,但是返回500
的网址甚至没有通过process_response
我只能在终端
中看到 [scrapy] DEBUG: Retrying <GET http:website.com> (failed 1 times): 503 Service Unavailable
简短问题:
我想通过传递自定义类503
的{{1}}方法来抓取再次返回process_response
的网址
答案 0 :(得分:1)
我有
在settings.py中 RETRY_HTTP_CODES = [503]
,这就是Scrapy为什么单独处理503代码的原因。
现在我将其更改为RETRY_HTTP_CODES = []
现在返回503
的每个网址都通过process_response
类的retrymiddleware.Retry
方法传递...
完成任务。
答案 1 :(得分:0)
根据文档RetryMiddleware
handles嘲笑500
代码,并且由于其优先级,您的代码无法达到响应(请查看基础{{3}我建议将Retry
中间件的优先级更改为650
,例如:
DOWNLOADER_MIDDLEWARES = {
'craigslist_tickets.retrymiddleware.Retry': 650,
'craigslist_tickets.crawlera_proxy_middleware.CrawleraProxyMiddleware': 200
}