在Scrapy中提升CloseSpider会有什么影响?

时间:2015-07-14 18:35:02

标签: python web-scraping scrapy scrapy-spider scraper

我想知道提升CloseSpider有什么影响。在文档http://doc.scrapy.org/en/latest/topics/exceptions.html#closespider中没有关于它的信息。如你所知,scrapy过程会同时处理一些请求。如果在处理最后一个请求之前提出此异常怎么办?它会等待处理之前被引用的休息请求吗? 例如:

def parse(self, response):
    my_url = 'http://someurl.com/item/'
    for i in range(1, 100):
         my_url += str(i)
         if i == 50:
             raise CloseSpider('')
         else:
             yield Request(url=my_url, callback=self.my_handler)

def my_handler(self, response):
     # handler

感谢您的回复。

======================== 可能的解决方案:

is_alive = True

def parse(self, response):
    my_url = 'http://url.com/item/'
    for i in range(1, 100):
        if not is_alive:
            break
        my_url += str(i)
        yield Request(url=my_url, callback=self.my_handler)

def my_handler(self, response):
    if (response do not contains new item):
        is_alive = False

1 个答案:

答案 0 :(得分:4)

根据source code,如果引发CloseSpider例外,则会执行engine.close_spider()方法:

def handle_spider_error(self, _failure, request, response, spider):
    exc = _failure.value
    if isinstance(exc, CloseSpider):
        self.crawler.engine.close_spider(spider, exc.reason or 'cancelled')
        return

engine.close_spider()本身会关闭蜘蛛并清除所有未完成的请求

def close_spider(self, spider, reason='cancelled'):
    """Close (cancel) spider and clear all its outstanding requests"""

    slot = self.slot
    if slot.closing:
        return slot.closing
    logger.info("Closing spider (%(reason)s)",
                {'reason': reason},
                extra={'spider': spider})

    dfd = slot.close()

    # ...

它还会为Scrapy的架构的不同组件安排close_spider()调用:下载程序,抓取程序,调度程序等。