在scrapy in循环中从脚本运行多个蜘蛛

时间:2018-01-31 07:47:37

标签: python web-scraping scrapy

我有超过100只蜘蛛,我想使用脚本一次运行5只蜘蛛。为此,我在数据库中创建了一个表,以了解蜘蛛的状态,即它是否已经完成运行,运行或等待运行。
我知道如何在脚本中运行多个蜘蛛

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
for i in range(10):  #this range is just for demo instead of this i 
                    #find the spiders that are waiting to run from database
    process.crawl(spider1)  #spider name changes based on spider to run
    process.crawl(spider2)
    print('-------------this is the-----{}--iteration'.format(i))
    process.start()

但是,由于发生以下错误,因此不允许这样做:

Traceback (most recent call last):
File "test.py", line 24, in <module>
  process.start()
File "/home/g/projects/venv/lib/python3.4/site-packages/scrapy/crawler.py", line 285, in start
  reactor.run(installSignalHandlers=False)  # blocking call
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1242, in run
  self.startRunning(installSignalHandlers=installSignalHandlers)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1222, in startRunning
  ReactorBase.startRunning(self)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 730, in startRunning
  raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

我搜索了上述错误但无法解决。管理蜘蛛可以通过ScrapyD完成,但我们不想使用ScrapyD,因为许多蜘蛛仍处于开发阶段。

感谢上述方案的任何解决方法。

由于

3 个答案:

答案 0 :(得分:1)

要同时运行多个蜘蛛,您可以使用此

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

this question的答案也可以帮到你。

了解更多信息:

<强> Running multiple spiders in the same process

答案 1 :(得分:1)

我能够通过从脚本中删除循环并每隔3分钟设置一个调度程序来实现类似的功能。

通过维护当前正在运行的蜘蛛数量并检查是否需要运行更多蜘蛛来实现循环功能。最后,只有5个(可以更改)蜘蛛可以同时运行。

答案 2 :(得分:0)

为此目的,您需要ScrapyD

您可以同时运行任意数量的蜘蛛,您可以使用listjobs API

持续检查蜘蛛是否正在运行状态

您可以在config file中设置max_proc=5,一次最多可运行5个蜘蛛。

无论如何,谈论你的代码,如果你这样做,你的代码就会起作用

process = CrawlerProcess(get_project_settings())
for i in range(10):  #this range is just for demo instead of this i 
                    #find the spiders that are waiting to run from database
    process.crawl(spider1)  #spider name changes based on spider to run
    process.crawl(spider2)
    print('-------------this is the-----{}--iteration'.format(i))
process.start()

您需要将process.start()置于循环之外。