我有一个看起来像这样的Scrapy蜘蛛。基本上它需要一个URL列表,遵循内部链接并抓取外部链接。我想要做的是让它有点同步,以便按顺序解析url_list。
class SomeSpider(Spider):
name = 'grablinksync'
url_list = ['http://www.sports.yahoo.com/', 'http://www.yellowpages.com/']
allowed_domains = ['www.sports.yahoo.com', 'www.yellowpages.com']
links_to_crawl = []
parsed_links = 0
def start_requests(self):
# Initial request starts here
start_url = self.url_list.pop(0)
return [Request(start_url, callback=self.get_links_to_parse)]
def get_links_to_parse(self, response):
for link in LinkExtractor(allow=self.allowed_domains).extract_links(response):
self.links_to_crawl.append(link.url)
yield Request(link.url, callback=self.parse_obj, dont_filter=True)
def start_next_request(self):
self.parsed_links = 0
self.links_to_crawl = []
# All links have been parsed, now generate request for next URL
if len(self.url_list) > 0:
yield Request(self.url_list.pop(0), callback=self.get_links_to_parse)
def parse_obj(self,response):
self.parsed_links += 1
for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
item = CrawlsItem()
item['DomainName'] = get_domain(response.url)
item['LinkToOtherDomain'] = link.url
item['LinkFoundOn'] = response.url
yield item
if self.parsed_links == len(self.links_to_crawl):
# This doesn't work
self.start_next_request()
我的问题是永远不会调用函数start_next_request()
。如果我在start_next_request()
函数内移动parse_obj()
内的代码,那么它会按预期工作。
def parse_obj(self,response):
self.parsed_links += 1
for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
item = CrawlsItem()
item['DomainName'] = get_domain(response.url)
item['LinkToOtherDomain'] = link.url
item['LinkFoundOn'] = response.url
yield item
if self.parsed_links == len(self.links_to_crawl):
# This works..
self.parsed_links = 0
self.links_to_crawl = []
# All links have been parsed, now generate request for next URL
if len(self.url_list) > 0:
yield Request(self.url_list.pop(0), callback=self.get_links_to_parse)
我想抽象掉start_next_request()
函数,因为我打算从其他几个地方调用它。我知道它与start_next_request()
作为生成器函数有关。但我对发电机和产量都很陌生,所以我很难弄清楚我做错了什么。
答案 0 :(得分:0)
那是因为yield
使函数成为一个生成器而只是简单地写self.start_next_request()
不会使生成器做任何事情。
生成器是懒惰的,这意味着除非你要求第一个对象 - 它不会做任何事情。
您可以将代码更改为:
def parse_obj(self,response):
self.parsed_links += 1
for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
item = CrawlsItem()
item['DomainName'] = get_domain(response.url)
item['LinkToOtherDomain'] = link.url
item['LinkFoundOn'] = response.url
yield item
if self.parsed_links == len(self.links_to_crawl):
for res in self.start_next_request():
yield res
当您返回生成器时,偶数return self.start_next_request()
将起作用。