Question

我有两个刮刀，它们的刮擦时间明显不同。目标是连续运行它们（完成后立即重新刮擦）并同时运行。

一种明显的解决方案是在单独的python进程中运行它们，但是出于挑战的原因，我决定将它们放在一个脚本中。

Scrapy网站上提供的示例未涵盖重启刮板。

您对更多pythonic代码有何建议？（我实际上并不关心递延的回报，因为所有内容都已保存到管道中的数据库中）

此外，此行是否真的在检查给定的Spider是否正在运行？

active_crawlers = [crawler.spidercls for crawler in runner.crawlers]
spider_active = spider in active_crawlers

这是我的代码：

import time
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor

import Spider1 # fast one
import Spider2 # slow one


def run_spider(spider, runner):
    while True:
        active_crawlers = [crawler.spidercls for crawler in runner.crawlers]
        spider_active = spider in active_crawlers

        if not spider_active:
            print(f'Starting spider {spider.name}!')
            runner.crawl(spider)
            print(f'Started spider {spider.name}!')
        else:
            print(f'Spider {spider.name} is already running!')
        time.sleep(30)


def main():
    configure_logging()
    runner = CrawlerRunner()
    reactor.callInThread(run_spider, Spider1, runner)
    reactor.callInThread(run_spider, Spider2, runner)

    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    reactor.run()


if __name__ == "__main__":
    main()

非常感谢所有帮助和评论。

编辑：代码突出显示

连续刮擦和同时刮擦

0 个答案: