如何使用Scrapy DownloadMiddleware重试503响应?

时间:2016-11-11 18:21:51

标签: python web-scraping scrapy

我自己有这个

from scrapy.downloadermiddlewares.retry import RetryMiddleware

class Retry(RetryMiddleware):

    def process_response(self, request, response, spider):

        if response.status == '503':
            logger.error("503 status returned: " + response.url)
            return self._retry(request,response, spider) or response

        logger.debug("response.status = "+str(response.status)+" from URL "+str(response.url))
        logger.debug(response.headers)

        return super(Retry, self).process_response(request, response, spider)


    def _retry(self, request,response, spider):

        logger.debug("Deleting session "+str(request.meta['sessionId']))
        self.delete_session(request.meta['sessionId'])

        logger.debug("Retrying URL: %(request)s", {'request': request})
        logger.debug("Response headers were:")
        logger.debug(request.headers)

        retryreq = request.copy()
        retryreq.headers['Authorization'] = crawlera_auth.strip()
        retryreq.headers['X-Crawlera-Session'] = 'create'

        retryreq.dont_filter = True
        return retryreq

settings.py我有这个

DOWNLOADER_MIDDLEWARES = {
    'craigslist_tickets.retrymiddleware.Retry': 100,
    'craigslist_tickets.crawlera_proxy_middleware.CrawleraProxyMiddleware': 200
}

对于所有已成功抓取的网址,我都可以看到response.status = 200之类的输出,但是返回500的网址甚至没有通过process_response

我只能在终端

中看到

[scrapy] DEBUG: Retrying <GET http:website.com> (failed 1 times): 503 Service Unavailable

简短问题:

我想通过传递自定义类503的{​​{1}}方法来抓取再次返回process_response的网址

2 个答案:

答案 0 :(得分:1)

我有

在settings.py中

RETRY_HTTP_CODES = [503],这就是Scrapy为什么单独处理503代码的原因。

现在我将其更改为RETRY_HTTP_CODES = []现在返回503的每个网址都通过process_response类的retrymiddleware.Retry方法传递...

完成任务。

答案 1 :(得分:0)

根据文档RetryMiddleware handles嘲笑500代码,并且由于其优先级,您的代码无法达到响应(请查看基础{{3}我建议将Retry中间件的优先级更改为650,例如:

DOWNLOADER_MIDDLEWARES = {
    'craigslist_tickets.retrymiddleware.Retry': 650,
    'craigslist_tickets.crawlera_proxy_middleware.CrawleraProxyMiddleware': 200
}