如何在Scrapy中处理429 Too Many Requests响应?

时间:2017-04-26 09:39:18

标签: web-scraping scrapy

我试图运行一个输出日志结束的刮刀,如下所示:

2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.apkmirror.com/apk/instagram/instagram-instagram/instagram-instagram-9-0-0-34920-release/instagram-9-0-0-4-android-apk-download/>: HTTP status code is not handled or not allowed
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-25 20:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16048410,
 'downloader/request_count': 32902,
 'downloader/request_method_count/GET': 32902,
 'downloader/response_bytes': 117633316,
 'downloader/response_count': 32902,
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/429': 32781,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 25, 18, 22, 22, 710446),
 'log_count/DEBUG': 32903,
 'log_count/INFO': 32815,
 'request_depth_max': 2,
 'response_received_count': 32902,
 'scheduler/dequeued': 32902,
 'scheduler/dequeued/memory': 32902,
 'scheduler/enqueued': 32902,
 'scheduler/enqueued/memory': 32902,
 'start_time': datetime.datetime(2017, 4, 25, 17, 54, 36, 621481)}
2017-04-25 20:22:22 [scrapy.core.engine] INFO: Spider closed (finished)

简而言之,在32,902个请求中,只有121个成功(响应代码200),而剩余部分接收429个请求太多&#39; (参见https://httpstatuses.com/429)。

有没有推荐的解决方法?首先,我想查看429响应的详细信息,而不是忽略它,因为它可能包含一个Retry-After标头,指示在发出新请求之前需要等待多长时间

此外,如果请求是使用http://blog.michaelyin.info/2014/02/19/scrapy-socket-proxy/中所述的Privoxy和Tor进行的,则可能会实现重试中间件,这会使Tor在发生这种情况时更改其IP地址。这些代码有公开的例子吗?

5 个答案:

答案 0 :(得分:10)

您可以修改重试中间件,以便在收到错误时暂停429.将此代码放在middlewares.py

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

import time

class TooManyRequestsRetryMiddleware(RetryMiddleware):

    def __init__(self, crawler):
        super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings)
        self.crawler = crawler

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        elif response.status == 429:
            self.crawler.engine.pause()
            time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on.
            self.crawler.engine.unpause()
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response
        elif response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response
        return response 

添加429以重试settings.py

中的代码
RETRY_HTTP_CODES = [429]

然后在settings.py上激活它。不要忘记停用默认的重试中间件。

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'flat.middlewares.TooManyRequestsRetryMiddleware': 543,
}

答案 1 :(得分:5)

哇,你的刮刀速度非常快,30分钟内超过30,000个请求。这超过了每秒10个请求。

如此高的音量将触发较大站点的速率限制,并将完全降低较小的站点。 不要这样做。

对于私密版和托尔版来说,这甚至可能太快了,所以这些也可能是429回复的候选人。

<强>解决方案:

  1. 启动速度慢。减少并发设置并增加DOWNLOAD_DELAY,以便每秒最多执行1次请求。然后逐步增加这些值,看看会发生什么。这可能听起来很悖论,但你可能会通过变慢来获得更多的物品和更多的200响应。

  2. 如果你要抓一个大网站尝试轮换代理。根据我的经验,对于这个网络可能有点笨拙,所以你可能会尝试像Umair建议的代理服务

答案 2 :(得分:0)

您可以使用a。我得到了429个HTTP代码,但我只允许了它,然后解决了问题。您可以允许进入终端的HTTP代码。这可能可以解决您的问题。

答案 3 :(得分:0)

Aminah Nuraini's answer 的基础上,您可以使用 Twisted's Deferreds 来避免通过调用 time.sleep() 来破坏异步

from twisted.internet import reactor, defer
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

class TooManyRequestsRetryMiddleware(RetryMiddleware):
    DEFAULT_DELAY = 600  # 10 min

    async def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response
        elif response.status == 429:
            try:
                delay = int(response.headers.get('retry-after'))
            except (TypeError, ValueError):
                delay = self.DEFAULT_DELAY

            deferred = defer.Deferred()
            reactor.callLater(delay, deferred.callback, None)
            spider.crawler.engine.pause()
            await deferred
            spider.crawler.engine.unpause()

            reason = response_status_message(response.status)
            return self._retry(await deferred, reason, spider) or response
        elif response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response
        return response

await deferred 阻止 process_response 执行,直到 delay 秒过去,但 Scrapy 在此期间可以自由地做其他事情。

注意方法定义中的 asyncasync/await 是 Python 3.5 和 Scrapy 2.0 中引入的 corutine syntax

在此示例中,我选择尊重 Retry-After 标头(如果可用),但这并不总是必须的。

仍然需要像original answer一样修改settings.py

答案 4 :(得分:0)

这是我发现的,一个简单的技巧

import scrapy
import time    ## just add this line

BASE_URL = 'your any url'
class EthSpider(scrapy.Spider):
    name = 'eth'
    start_urls = [
        BASE_URL.format(1)
    ]
    pageNum = 2

def parse(self, response):
    data = response.json()
    
    for i in range(len(data['data']['list'])):
        yield data['data']['list'][i]

    next_page = 'next page url'

    time.sleep(0.2)      # and add this line

    if EthSpider.pageNum <= data['data']['page']:
        EthSpider.pageNum += 1
        yield response.follow(next_page, callback=self.parse)