连续刮擦和同时刮擦

时间:2020-01-02 15:45:00

标签: concurrency scrapy web-crawler twisted

我有两个刮刀,它们的刮擦时间明显不同。目标是连续运行它们(完成后立即重新刮擦)并同时运行。

一种明显的解决方案是在单独的python进程中运行它们,但是出于挑战的原因,我决定将它们放在一个脚本中。

Scrapy网站上提供的示例未涵盖重启刮板。

您对更多pythonic代码有何建议? (我实际上并不关心递延的回报,因为所有内容都已保存到管道中的数据库中)

此外,此行是否真的在检查给定的Spider是否正在运行?

active_crawlers = [crawler.spidercls for crawler in runner.crawlers]
spider_active = spider in active_crawlers

这是我的代码:

import time
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor

import Spider1 # fast one
import Spider2 # slow one


def run_spider(spider, runner):
    while True:
        active_crawlers = [crawler.spidercls for crawler in runner.crawlers]
        spider_active = spider in active_crawlers

        if not spider_active:
            print(f'Starting spider {spider.name}!')
            runner.crawl(spider)
            print(f'Started spider {spider.name}!')
        else:
            print(f'Spider {spider.name} is already running!')
        time.sleep(30)


def main():
    configure_logging()
    runner = CrawlerRunner()
    reactor.callInThread(run_spider, Spider1, runner)
    reactor.callInThread(run_spider, Spider2, runner)

    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    reactor.run()


if __name__ == "__main__":
    main()

非常感谢所有帮助和评论。

编辑:代码突出显示

0 个答案:

没有答案