Python Scrapy不会重试超时连接

时间:2013-12-12 01:55:29

标签: python web-scraping screen-scraping scrapy

我使用了一些代理来抓取某些网站。这是我在settings.py中做的:

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOAD_DELAY = 3 # 5,000 ms of delay

DOWNLOADER_MIDDLEWARES = {
                    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
                    'myspider.comm.rotate_useragent.RotateUserAgentMiddleware' : 100,

                    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 200,
                    'myspider.comm.random_proxy.RandomProxyMiddleware': 300,

                    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 400,
                }

我还有一个代理下载中间件,它有以下方法:

def process_request(self, request, spider):
    log('Requesting url %s with proxy %s...' % (request.url, proxy))

def process_response(self, request, response, spider):
    log('Response received from request url %s with proxy %s' % (request.url, proxy if proxy else 'nil'))

def process_exception(self, request, exception, spider):
    log_msg('Failed to request url %s with proxy %s with exception %s' % (request.url, proxy if proxy else 'nil', str(exception)))
    #retry again.
    return request

由于代理有时不是很稳定,因此process_exception经常会提示很多请求失败消息。这里的问题是失败的请求从未再次尝试过。

如前所示,我已设置RETRY_TIMES和RETRY_HTTP_CODES设置,并且还在代理中间件的process_exception方法中返回重试请求。

为什么scrapy再也不会重试失败请求,或者我怎样才能确保请求至少尝试过我在settings.py中设置的RETRY_TIMES?

2 个答案:

答案 0 :(得分:6)

感谢Scrapy IRC频道@nyov的帮助。

  

'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware':200,
  'myspider.comm.random_proxy.RandomProxyMiddleware':300,

此处重试中间件首先运行,因此它会在进入Proxy中间件之前重试该请求。在我的情况下,scrapy需要代理来抓取网站,否则它会无休止地超时。

所以我颠倒了这两个下载中间件之间的优先级:

  

'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware':300,
  'myspider.comm.random_proxy.RandomProxyMiddleware':200,

答案 1 :(得分:0)

似乎您的代理下载中间件 - > process_response不遵守规则,因此破坏了中间件链

  

process_response()应该:返回Response对象,返回Request对象或引发IgnoreRequest异常。

     

如果它返回一个Response(它可能是同一个给定的响应,或者是一个全新的响应),那么该响应将继续使用链中下一个中间件的process_response()进行处理。

     

...