How does Scrapy proceed with the urls given in the urls variable under start_requests?

时间:2019-02-24 03:03:44

标签: python scrapy

Just wondering why when I have url = ['site1', 'site2'] and I run scrapy from script using .crawl() twice, in a row like

 def run_spiders():    
    process.crawl(Spider)
    process.crawl(Spider)

the output is:

site1info
site1info
site2info
site2info 

as opposed to

site1info
site2info
site1info
site2info

2 个答案:

答案 0 :(得分:0)

开始请求使用收益功能。 yield将请求排队。要全面了解它,请阅读this StackOverflow答案。

这是在start_request方法中如何使用start_urls的代码示例。

start_urls = [
    "url1.com",
    "url2.com",
   ]    

 def start_requests(self):
    for u in self.start_urls:
        yield scrapy.Request(u, callback=self.parse)

对于自定义请求订购,可以使用this优先级功能。

def start_requests(self):
    yield scrapy.Request(self.start_urls[0], callback=self.parse)
    yield scrapy.Request(self.start_urls[1], callback=self.parse, priorty=1)

优先级较高的那个将从队列中首先产生。默认情况下,优先级为0。

答案 1 :(得分:0)

因为一旦您调用process.start(),请求就会被异步处理。不能保证顺序。

实际上,即使您只打一次process.crawl(),有时也可能会得到:

site2info
site1info

要从Python顺序运行蜘蛛,请参阅this other answer

相关问题