cra蜘蛛关闭

时间:2019-02-12 18:43:31

标签: python-3.x web-scraping scrapy

我的脚本关闭后,我需要运行一个脚本。我看到Scrapy有一个名为spider_closed()的处理程序,但我不了解的是如何将其合并到脚本中。我想要做的是,一旦抓取工具完成抓取,我想将所有的csv文件合并到一起,并将它们加载到工作表中。如果有人能做到这一点,那就太好了。

2 个答案:

答案 0 :(得分:1)

按照documentation中的示例,您将以下内容添加到Spider:

# This function remains as-is.
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super().from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
    return spider

# This is where you do your CSV combination.
def spider_closed(self, spider):
    # Whatever is here will run when the spider is done.
    combine_csv_to_sheet()

答案 1 :(得分:1)

根据我对other answer about a signal-based solution的评论,这是在完成多个蜘蛛程序后运行一些代码的一种方法。这不涉及使用spider_closed信号。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


process = CrawlerProcess(get_project_settings())
process.crawl('spider1')
process.crawl('spider2')
process.crawl('spider3')
process.crawl('spider4')
process.start()

# CSV combination code goes here. It will only run when all the spiders are done.
# ...