我正在尝试抓取一个大型网站。他们有一个限速系统。当遇到403页时,是否可以暂停scrapy 10分钟?我知道我可以设置一个DOWNLOAD_DELAY但是我注意到我可以通过设置一个小的DOWNLOAD_DELAY来快速刮擦,然后当它达到403时暂停scrapy几分钟。这样,速率限制每小时左右只触发一次。
答案 0 :(得分:4)
您可以编写自己的重试中间件并将其放到middleware.py
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
from time import sleep
class SleepRetryMiddleware(RetryMiddleware):
def __init__(self, settings):
RetryMiddleware.__init__(self, settings)
def process_response(self, request, response, spider):
if response.status in [403]:
sleep(120) # few minutes
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return super(SleepRetryMiddleware, self).process_response(request, response, spider)
并且不要忘记更改settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project.middlewares.SleepRetryMiddleware': 100,
}
答案 1 :(得分:1)
Scrapy是一个基于Twisted的Python框架。所以,永远不要在其中使用time.sleep
或pause.until
!
相反,请尝试使用Twisted中的Deferred()
。
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
parse_and_pause = Deferred() # changed
parse_and_pause.addCallback(self.second_parse_function) # changed
parse_and_pause.addCallback(pause, seconds=10) # changed
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=parse_and_pause) # changed
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
此处有更多信息:Scrapy: non-blocking pause