Scrapy:是否可以暂停Scrapy并在x分钟后恢复?

时间:2014-01-16 19:37:33

标签: python scrapy

我正在尝试抓取一个大型网站。他们有一个限速系统。当遇到403页时,是否可以暂停scrapy 10分钟?我知道我可以设置一个DOWNLOAD_DELAY但是我注意到我可以通过设置一个小的DOWNLOAD_DELAY来快速刮擦,然后当它达到403时暂停scrapy几分钟。这样,速率限制每小时左右只触发一次。

2 个答案:

答案 0 :(得分:4)

您可以编写自己的重试中间件并将其放到middleware.py

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from time import sleep

class SleepRetryMiddleware(RetryMiddleware):
    def __init__(self, settings):
        RetryMiddleware.__init__(self, settings)

    def process_response(self, request, response, spider):
        if response.status in [403]:
            sleep(120)  # few minutes
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response

        return super(SleepRetryMiddleware, self).process_response(request, response, spider)

并且不要忘记更改settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'your_project.middlewares.SleepRetryMiddleware': 100,
}

答案 1 :(得分:1)

Scrapy是一个基于Twisted的Python框架。所以,永远不要在其中使用time.sleeppause.until! 相反,请尝试使用Twisted中的Deferred()

class ScrapySpider(Spider):
    name = 'live_function'

    def start_requests(self):
        yield Request('some url', callback=self.non_stop_function)

    def non_stop_function(self, response):

        parse_and_pause = Deferred()  # changed
        parse_and_pause.addCallback(self.second_parse_function) # changed
        parse_and_pause.addCallback(pause, seconds=10)  # changed

        for url in ['url1', 'url2', 'url3', 'more urls']:
            yield Request(url, callback=parse_and_pause)  # changed

        yield Request('some url', callback=self.non_stop_function)  # Call itself

    def second_parse_function(self, response):
        pass

此处有更多信息:Scrapy: non-blocking pause