我在同一个项目中有两只蜘蛛。其中一个取决于另一个先运行。他们使用不同的管道。我怎样才能确保它们按顺序运行?
答案 0 :(得分:2)
[Spider2 list] - 依赖于 - > [spider1 list]
在spider1成功完成后,让spider2运行怎么样:
#scrapy crawl Spider1 && scrapy crawl Spider2
个人[蜘蛛2] - 依赖于 - > [蜘蛛1项]
当你刮掉蜘蛛1物品时,你知道要抓住的个体蜘蛛2 url
。
如何将两个蜘蛛合并为一个?
请求meta
属性。
spider.py
class MergedSpider(scrapy.Spider):
# name, etc..
def first_spider_parse(self, response):
# your code...
item = FirstSpiderItem()
# yield the item first, and the pipeline will handle it
yield item
# then request the spider2 request
yield scrapy.Request(secondSpiderItemURL, callback=self.second_spider_parse, dont_filter=True, meta={'firstItem': item})
def second_spider_parse(self, response):
item = SecondSpiderItem()
firstItem = response.meta['firstItem']
return item
pipelines.py
class FirstPipeline(object):
def process_item(self, item, spider):
# or you can isinstance the spider
if isinstance(item, FirstSpiderItem):
# your code
pass
return item
class SecondPipeline(object):
def process_item(self, item, spider):
if isinstance(item, SecondSpiderItem):
# your code
pass
return item
答案 1 :(得分:2)
仅来自文档:https://doc.scrapy.org/en/1.2/topics/request-response.html
相同的示例,但通过链接延迟来顺序运行蜘蛛:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished