在scrapy中返回第一个遇到的结果

时间:2017-01-30 14:53:22

标签: python python-3.x web-scraping scrapy web-crawler

  

问题陈述:

解析后,我将每个URL发送到parse_links以从中提取电子邮件地址。

解析后,如果我从该链接找到电子邮件地址并返回结果,我想停止迭代。

  

在循环中假设有2个网址: example.com/contact example.com/about

     

如果从 example.com/contact 找到电子邮件地址,那么我不想废弃第二个。但我从所有链接都收到了电子邮件地址。

这是我的代码:

def parse(self, response):
    urls = [
        instance.url for instance in LinkExtractor(
            allow_domains='example.com'
        ).extract_links(response)
    ]

    for url in sorted(urls, reverse=True):
        request = Request(url, callback=self.parse_links)
        yield request

def parse_links(self, response):
    item = EmailScraperItem()
    mailrex = '[\w\.-]+@[\w\.-]+'
    result = response.xpath('//a[@href]').re('%s' % mailrex)
    if result:
        item['emails'] = result  # here how can I send first value and ignore other results
    return item

运行抓取工具后,我得到了这个输出:

2017-01-30 20:31:27 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.com/contact/>
{'emails': ['abc@example.com']}  # first result

2017-01-30 20:31:29 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.com/about/>
{'emails': ['xyz@example.com']}  # second result

我只想要第一个。

1 个答案:

答案 0 :(得分:1)

由于Scrapy的异步特性,您无法确定响应是否会以与它们发出的顺序相同的顺序进入回调。您可以做的是获取网址列表,使用meta传递,并按顺序访问网址:

def parse(self, response):
    urls = [
        instance.url for instance in LinkExtractor(
            allow_domains='example.com'
        ).extract_links(response)
    ]

    try:
       # take url and pass remaining to the callback
       return Request(urls.pop(), callback=self.parse_links, meta={'urls': urls})
    except IndexError:
       pass

def parse_links(self, response):
    item = EmailScraperItem()
    mailrex = '[\w\.-]+@[\w\.-]+'
    result = response.xpath('//a[@href]').re('%s' % mailrex)
    if result:
        item['emails'] = result  # here how can I send first value and ignore other results
        return item
    # if no emails found, request next url from list
    try:
       urls = response.meta['urls']
       return Request(urls.pop(), callback=self.parse_links, meta={'urls': urls})
    except IndexError:
       pass