我有超过100只蜘蛛,我想使用脚本一次运行5只蜘蛛。为此,我在数据库中创建了一个表,以了解蜘蛛的状态,即它是否已经完成运行,运行或等待运行。
我知道如何在脚本中运行多个蜘蛛
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
for i in range(10): #this range is just for demo instead of this i
#find the spiders that are waiting to run from database
process.crawl(spider1) #spider name changes based on spider to run
process.crawl(spider2)
print('-------------this is the-----{}--iteration'.format(i))
process.start()
但是,由于发生以下错误,因此不允许这样做:
Traceback (most recent call last):
File "test.py", line 24, in <module>
process.start()
File "/home/g/projects/venv/lib/python3.4/site-packages/scrapy/crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1242, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1222, in startRunning
ReactorBase.startRunning(self)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 730, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
我搜索了上述错误但无法解决。管理蜘蛛可以通过ScrapyD
完成,但我们不想使用ScrapyD
,因为许多蜘蛛仍处于开发阶段。
感谢上述方案的任何解决方法。
由于
答案 0 :(得分:1)
要同时运行多个蜘蛛,您可以使用此
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
this question的答案也可以帮到你。
了解更多信息:
答案 1 :(得分:1)
我能够通过从脚本中删除循环并每隔3分钟设置一个调度程序来实现类似的功能。
通过维护当前正在运行的蜘蛛数量并检查是否需要运行更多蜘蛛来实现循环功能。最后,只有5个(可以更改)蜘蛛可以同时运行。
答案 2 :(得分:0)
为此目的,您需要ScrapyD
您可以同时运行任意数量的蜘蛛,您可以使用listjobs API
持续检查蜘蛛是否正在运行状态您可以在config file中设置max_proc=5
,一次最多可运行5个蜘蛛。
无论如何,谈论你的代码,如果你这样做,你的代码就会起作用
process = CrawlerProcess(get_project_settings())
for i in range(10): #this range is just for demo instead of this i
#find the spiders that are waiting to run from database
process.crawl(spider1) #spider name changes based on spider to run
process.crawl(spider2)
print('-------------this is the-----{}--iteration'.format(i))
process.start()
您需要将process.start()
置于循环之外。