Scrapy,限制start_url

时间:2017-04-21 02:33:09

标签: python scrapy

我想知道我可以分配给我的蜘蛛01/20/2010 compared to 02/30/2014 returns true or false. 的数量是否有限制? 据我搜索过,似乎没有关于列表限制的文档。

目前我已经设置了我的蜘蛛,以便从csv文件中读入start_urls列表。网址数量约为1,000,000。

1 个答案:

答案 0 :(得分:7)

本身没有限制,但你可能想自己限制它,否则你最终可能会遇到记忆问题 可能发生的是所有这些1M网址将被安排到scrapy调度程序,因为python对象比普通字符串重得多,你最终会耗尽内存。

为避免这种情况,您可以使用spider_idle信号批处理您的起始网址:

class MySpider(Spider):
    name = "spider"
    urls = []
    batch_size = 10000

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = cls(crawler, *args, **kwargs)
        crawler.signals.connect(spider.idle_consume, signals.spider_idle)
        return spider 

    def __init__(self, crawler):
        self.crawler = crawler
        self.urls = [] # read from file

    def start_requests(self):
        for i in range(self.batch_size):
            url = self.urls.pop(0)
            yield Request(url)


    def parse(self, response):
        pass
        # parse

    def idle_consume(self):
        """
        Everytime spider is about to close check our urls 
        buffer if we have something left to crawl
        """
        reqs = self.start_requests()
        if not reqs:
            return
        logging.info('Consuming batch')
        for req in reqs:
            self.crawler.engine.schedule(req, self)
        raise DontCloseSpider