我想知道提升CloseSpider有什么影响。在文档http://doc.scrapy.org/en/latest/topics/exceptions.html#closespider中没有关于它的信息。如你所知,scrapy过程会同时处理一些请求。如果在处理最后一个请求之前提出此异常怎么办?它会等待处理之前被引用的休息请求吗? 例如:
def parse(self, response):
my_url = 'http://someurl.com/item/'
for i in range(1, 100):
my_url += str(i)
if i == 50:
raise CloseSpider('')
else:
yield Request(url=my_url, callback=self.my_handler)
def my_handler(self, response):
# handler
感谢您的回复。
======================== 可能的解决方案:
is_alive = True
def parse(self, response):
my_url = 'http://url.com/item/'
for i in range(1, 100):
if not is_alive:
break
my_url += str(i)
yield Request(url=my_url, callback=self.my_handler)
def my_handler(self, response):
if (response do not contains new item):
is_alive = False
答案 0 :(得分:4)
根据source code,如果引发CloseSpider
例外,则会执行engine.close_spider()
方法:
def handle_spider_error(self, _failure, request, response, spider):
exc = _failure.value
if isinstance(exc, CloseSpider):
self.crawler.engine.close_spider(spider, exc.reason or 'cancelled')
return
engine.close_spider()
本身会关闭蜘蛛并清除所有未完成的请求:
def close_spider(self, spider, reason='cancelled'):
"""Close (cancel) spider and clear all its outstanding requests"""
slot = self.slot
if slot.closing:
return slot.closing
logger.info("Closing spider (%(reason)s)",
{'reason': reason},
extra={'spider': spider})
dfd = slot.close()
# ...
它还会为Scrapy的架构的不同组件安排close_spider()
调用:下载程序,抓取程序,调度程序等。