刮网站徽标

时间:2017-12-25 11:23:48

标签: python python-3.x web-scraping scrapy scrapy-spider

我有网站,我想抓他们的标识。

问题:

我有一个外部类,我在其中保存了有关徽标的所有数据 - 网址,链接,一切正常:

class PatternUrl:

    def __init__(self, path_to_img="", list_of_conditionals=[]):
        self.url_pattern = ""
        self.file_url = ""
        self.path_to_img = path_to_img
        self.list_of_conditionals = list_of_conditionals

    def find_obj(self, response):
        for el in self.list_of_conditionals:
            if el:
                if self.path_to_img:
                    url = response
                    file_url = str(self.path_to_img)
                    print(file_url)
                    yield LogoScrapeItem(url=url, file_url=file_url)

class LogoSpider(scrapy.Spider):
....
def parse(self, response):
        a = PatternUrl(response.css("header").xpath("//a[@href='"+response.url+'/'+"']/img/@src").extract_first(), [response.css("header").xpath("//a[@href='"+response.url+'/'+"']")] )
        a.find_obj(response)

问题在于产量线

yield LogoScrapeItem(url=url, file_url=file_url)

由于某些原因,当我评论此行时,此方法中的所有行都正在执行。

评论收益率时的输出:

#yield LogoScrapeItem(url=url, file_url=file_url)

2017-12-25 11:09:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://time.com> (referer: None)
........
2017-12-25 11:09:32 [scrapy.core.engine] INFO: Closing spider (finished)
2017-12-25 11:09:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

产量未评论时的输出

yield LogoScrapeItem(url=url, file_url=file_url)

2017-12-25 11:19:28 [scrapy.core.engine] INFO: Spider opened
2017-12-25 11:19:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-25 11:19:28 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-12-25 11:19:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://git-scm.com/robots.txt> (referer: None)
2017-12-25 11:19:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://git-scm.com/docs/git-merge> (referer: None)
2017-12-25 11:19:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://time.com/robots.txt> (referer: None)
2017-12-25 11:19:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://time.com> (referer: None)
2017-12-25 11:19:29 [scrapy.core.engine] INFO: Closing spider (finished)
2017-12-25 11:19:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 926,

问题:

当存在yield语句时,不执行该函数,为什么?

2 个答案:

答案 0 :(得分:1)

由于find_obj关键字,您的yield方法实际上是一个生成器。有关生成器和yield的详细说明,我建议this StackOverflow question

为了从您的方法中获得结果,您应该以类似于此的方式调用它:

for logo_scrape_item in a.find_obj(response):
    # perform an action on your logo_scrape_item

答案 1 :(得分:1)

产量旨在生产发电机。

看起来您应该将find_obj运行为:

for x in a.find_obj(response):

代替。

有关产量的详情,请参阅What does the "yield" keyword do?