好几天以来,我在Main.py中都遇到了 Scrapy / twisted问题,该问题应该运行不同的蜘蛛并分析它们的输出。不幸的是, MySpider2依靠来自 MySpider1 的进给,因此只能在MySpider1完成后才能运行。此外, MySpider1和MySpider2具有不同的特点设置。到目前为止,我还没有找到一种解决方案,可以让我使用各自的唯一设置按顺序运行蜘蛛。我查看了Scrapy CrawlerRunner和CrawlerProcess docs,并尝试了几个相关的stackoverflow问题(Run Multiple Spider sequentially,Scrapy: how to run two crawlers one after another?,Scrapy run multiple spiders from a script等)。
按照有关顺序蜘蛛的文档,我的代码(略有改编)将是:
from MySpider1.myspider1.spiders.myspider1 import MySpider1
from MySpider2.myspider2.spiders.myspider2 import MySpider2
from twisted.internet import defer, reactor
from scrapy.crawler import CrawlerProcess
from scrapy.crawler import CrawlerRunner
spider_settings = [{
'FEED_URI':'abc.csv',
'LOG_FILE' :'abc/log.log'
#MORE settings are here
},{
'FEED_URI' : '123.csv',
'LOG_FILE' :'123/log.log'
#MORE settings are here
}]
spiders = [MySpider1, MySpider2]
process = CrawlerRunner(spider_settings[0])
process = CrawlerRunner(spider_settings[1]) #Not sure if this is how its supposed to be used for
#multiple settings but passing this line before "yield process.crawl(spiders[1])" also results in an error.
@defer.inlineCallbacks
def crawl():
yield process.crawl(spiders[0])
yield process.crawl(spiders[1])
reactor.stop()
crawl()
reactor.run()
但是,使用此代码,仅执行第一个蜘蛛,并且没有任何设置。因此,我尝试使用CrawlerProcess产生更多效果:
from MySpider1.myspider1.spiders.myspider1 import MySpider1
from MySpider2.myspider2.spiders.myspider2 import MySpider2
from twisted.internet import defer, reactor
from scrapy.crawler import CrawlerProcess
from scrapy.crawler import CrawlerRunner
spider_settings = [{
'FEED_URI':'abc.csv',
'LOG_FILE' :'abc/log.log'
#MORE settings are here
},{
'FEED_URI' : '123.csv',
'LOG_FILE' :'123/log.log'
#MORE settings are here
}]
spiders = [MySpider1, MySpider2]
process = CrawlerProcess(spider_settings[0])
process = CrawlerProcess(spider_settings[1])
@defer.inlineCallbacks
def crawl():
yield process.crawl(spiders[0])
yield process.crawl(spiders[1])
reactor.stop()
crawl()
reactor.run()
此代码执行两个蜘蛛,但同时执行,而不是按预期顺序执行。此外,它还用一秒钟的spider [1]覆盖了spider [0]的设置,导致日志文件仅在两行之后被切断,并在123 / log.log中恢复了这两个Spider。
在理想的世界中,我的代码段将按以下方式工作:
预先感谢您的帮助。
答案 0 :(得分:1)
分开跑步者,它应该起作用
process_1 = CrawlerRunner(spider_settings[0])
process_2 = CrawlerRunner(spider_settings[1])
#...
@defer.inlineCallbacks
def crawl():
yield process_1.crawl(spiders[0])
yield process_2.crawl(spiders[1])
reactor.stop()
#...