Scrapy重试中间件失败,带有非标准的http状态代码

时间:2016-04-26 03:33:22

标签: python parsing scrapy

我正在使用Scrapy默认的RetryMiddleware来尝试重新下载失败的网址。我想处理这样的方式页面,在响应时得到429个状态代码(" Too Many Requests")。

但我收到了错误

  Traceback (most recent call last):
  File "/home/vagrant/parse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/vagrant/parse/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 46, in process_response
    response = method(request=request, response=response, spider=spider)
  File "/home/vagrant/parse/local/lib/python2.7/site-packages/scrapy/downloadermiddlewares/retry.py", line 58, in process_response
    reason = response_status_message(response.status)
  File "/home/vagrant/parse/local/lib/python2.7/site-packages/scrapy/utils/response.py", line 58, in response_status_message
    reason = http.RESPONSES.get(int(status)).decode('utf8', errors='replace')
AttributeError: 'NoneType' object has no attribute 'decode'

我试图调试问题并发现Scrapy RetryMiddleware在实际重试下载页面之前尝试定义先前失败的原因。 因此response_status_message方法尝试使用状态代码和状态文本创建字符串,例如

>>> response_status_message(404)
    '404 Not Found'

要获取响应字符串,它使用扭曲的响应方法http.RESPONSES.get(int(status))。但是在没有使用get()的默认参数的自定义http状态代码的情况下,它返回NoneType而不是string。

因此,Scrapy尝试将decode('utf8', errors='replace')调用为NoneType。

是否有可能避免这种情况?

1 个答案:

答案 0 :(得分:3)

这实际上是Scrapy库中的错误。但它已在this commit中修复,并放在RC1.1 changelogs