我有几种不同的蜘蛛,想要立刻运行它们。基于this和this,我可以在同一个进程中运行多个蜘蛛。但是,我不知道如何设计一个信号系统,以便在所有蜘蛛完成后停止反应堆。
我试过了:
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
和
crawler.signals.connect(reactor.stop, signal=signals.spider_idle)
在这两种情况下,当第一个履带关闭时,反应器停止。 当然,我希望在所有蜘蛛完成后反应堆停止。
有人可以告诉我如何做到这一点吗?
答案 0 :(得分:6)
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings
class ReactorControl:
def __init__(self):
self.crawlers_running = 0
def add_crawler(self):
self.crawlers_running += 1
def remove_crawler(self):
self.crawlers_running -= 1
if self.crawlers_running == 0 :
reactor.stop()
def setup_crawler(spider_name):
crawler = Crawler(settings)
crawler.configure()
crawler.signals.connect(reactor_control.remove_crawler, signal=signals.spider_closed)
spider = crawler.spiders.create(spider_name)
crawler.crawl(spider)
reactor_control.add_crawler()
crawler.start()
reactor_control = ReactorControl()
log.start()
settings = get_project_settings()
crawler = Crawler(settings)
for spider_name in crawler.spiders.list():
setup_crawler(spider_name)
reactor.run()
我假设Scrapy不平行。
我不知道这是否是最佳方式,但它确实有效!
修改:已更新。见@ Jean-Robert评论。