我有两个刮刀,它们的刮擦时间明显不同。目标是连续运行它们(完成后立即重新刮擦)并同时运行。
一种明显的解决方案是在单独的python进程中运行它们,但是出于挑战的原因,我决定将它们放在一个脚本中。
Scrapy网站上提供的示例未涵盖重启刮板。
您对更多pythonic代码有何建议? (我实际上并不关心递延的回报,因为所有内容都已保存到管道中的数据库中)
此外,此行是否真的在检查给定的Spider是否正在运行?
active_crawlers = [crawler.spidercls for crawler in runner.crawlers]
spider_active = spider in active_crawlers
这是我的代码:
import time
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from twisted.internet import reactor
import Spider1 # fast one
import Spider2 # slow one
def run_spider(spider, runner):
while True:
active_crawlers = [crawler.spidercls for crawler in runner.crawlers]
spider_active = spider in active_crawlers
if not spider_active:
print(f'Starting spider {spider.name}!')
runner.crawl(spider)
print(f'Started spider {spider.name}!')
else:
print(f'Spider {spider.name} is already running!')
time.sleep(30)
def main():
configure_logging()
runner = CrawlerRunner()
reactor.callInThread(run_spider, Spider1, runner)
reactor.callInThread(run_spider, Spider2, runner)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
if __name__ == "__main__":
main()
非常感谢所有帮助和评论。
编辑:代码突出显示