以下是代码:
if __name__ == '__main__':
cmdline.execute("scrapy crawl spider_a -L INFO".split())
cmdline.execute("scrapy crawl spider_b -L INFO".split())
我打算在scrapy
项目下的同一个主门户网站中运行多个蜘蛛,但事实证明只有第一个蜘蛛成功运行,而第二个蜘蛛似乎被忽略了。有什么建议?
答案 0 :(得分:2)
只做
import subprocess
subprocess.call('for spider in spider_a spider_b; do scrapy crawl $spider -L INFO; done', shell=True)
答案 1 :(得分:0)
来自scrapy文档:https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
import scrapy
from scrapy.crawler import CrawlerProcess
from .spiders import Spider1, Spider2
process = CrawlerProcess()
process.crawl(Crawler1)
process.crawl(Crawler2)
process.start() # the script will block here until all crawling jobs are finished
编辑:如果您希望逐个运行多个蜘蛛,您可以执行以下操作:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
configure_logging()
runner = CrawlerRunner()
spiders = [Spider1, Spider2, Spider3, Spider4]
def join_spiders(spiders):
"""Setup a new runner with the provided spiders"""
runner = CrawlerRunner()
# Add each spider to the current runner
for spider in spider:
runner.crawl(MySpider1)
# This will yield when all the spiders inside the runner finished
yield runner.join()
@defer.inlineCallbacks
def crawl(group_by=2):
# Yield a new runner containing `group_by` spiders
for i in range(0, len(spiders), step=group_by):
yield join_spiders(spiders[i:i + group_by])
# When we finished running all the spiders, stop the twisted reactor
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
虽然没有对所有这些进行测试,但请告诉我它是否有效!