我正在尝试在单独的脚本中执行scrapy spider,当我在循环中执行此脚本时(例如运行具有不同参数的相同spider),我得到ReactorAlreadyRunning
。我的片段:
from celery import task
from episode.skywalker.crawlers import settings
from multiprocessing.queues import Queue
from scrapy import log, project, signals
from scrapy.settings import CrawlerSettings
from scrapy.spider import BaseSpider
from scrapy.spidermanager import SpiderManager
from scrapy.xlib.pydispatch import dispatcher
import multiprocessing
from twisted.internet.error import ReactorAlreadyRunning
class CrawlerWorker(multiprocessing.Process):
def __init__(self, spider, result_queue):
from scrapy.crawler import CrawlerProcess
multiprocessing.Process.__init__(self)
self.result_queue = result_queue
self.crawler = CrawlerProcess(CrawlerSettings(settings))
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
try:
self.crawler.start()
except ReactorAlreadyRunning:
pass
self.crawler.stop()
self.result_queue.put(self.items)
@task
def execute_spider(spider, **spider__kwargs):
'''
Execute spider within separate process
@param spider: spider class to crawl or the name (check if instance)
'''
if not isinstance(spider, BaseSpider):
manager = SpiderManager(settings.SPIDER_MODULES)
spider = manager.create(spider, **spider__kwargs)
result_queue = Queue()
crawler = CrawlerWorker(spider, result_queue)
crawler.start()
items = []
for item in result_queue.get():
items.append(item)
我的建议是它是由多个扭曲的反应堆运行引起的。 我怎么能避免呢?通常有一种方法可以在没有反应堆的情况下运行蜘蛛吗?
答案 0 :(得分:1)
我想通了,导致问题的原因是:如果在execute_spider
进程中以某种方式调用CrawlerWorker
方法(例如通过递归),则会导致创建第二个反应堆,不可能。
我的解决方案:在execute_spider
方法中移动所有语句,导致递归调用,因此它们将在同一进程中触发蜘蛛执行,而不是在辅助CrawlerWorker
中。我还建立了这样一个声明
try:
self.crawler.start()
except ReactorAlreadyRunning:
raise RecursiveSpiderCall("Spider %s was called from another spider recursively. Such behavior is not allowed" % (self.spider))
捕捉无意中的蜘蛛递归调用。