问题陈述:
解析后,我将每个URL发送到parse_links以从中提取电子邮件地址。
解析后,如果我从该链接找到电子邮件地址并返回结果,我想停止迭代。
即
在循环中假设有2个网址: example.com/contact 和 example.com/about
如果从 example.com/contact 找到电子邮件地址,那么我不想废弃第二个。但我从所有链接都收到了电子邮件地址。
这是我的代码:
def parse(self, response):
urls = [
instance.url for instance in LinkExtractor(
allow_domains='example.com'
).extract_links(response)
]
for url in sorted(urls, reverse=True):
request = Request(url, callback=self.parse_links)
yield request
def parse_links(self, response):
item = EmailScraperItem()
mailrex = '[\w\.-]+@[\w\.-]+'
result = response.xpath('//a[@href]').re('%s' % mailrex)
if result:
item['emails'] = result # here how can I send first value and ignore other results
return item
运行抓取工具后,我得到了这个输出:
2017-01-30 20:31:27 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.com/contact/>
{'emails': ['abc@example.com']} # first result
2017-01-30 20:31:29 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.com/about/>
{'emails': ['xyz@example.com']} # second result
我只想要第一个。
答案 0 :(得分:1)
由于Scrapy的异步特性,您无法确定响应是否会以与它们发出的顺序相同的顺序进入回调。您可以做的是获取网址列表,使用meta
传递,并按顺序访问网址:
def parse(self, response):
urls = [
instance.url for instance in LinkExtractor(
allow_domains='example.com'
).extract_links(response)
]
try:
# take url and pass remaining to the callback
return Request(urls.pop(), callback=self.parse_links, meta={'urls': urls})
except IndexError:
pass
def parse_links(self, response):
item = EmailScraperItem()
mailrex = '[\w\.-]+@[\w\.-]+'
result = response.xpath('//a[@href]').re('%s' % mailrex)
if result:
item['emails'] = result # here how can I send first value and ignore other results
return item
# if no emails found, request next url from list
try:
urls = response.meta['urls']
return Request(urls.pop(), callback=self.parse_links, meta={'urls': urls})
except IndexError:
pass