Scrapy:如何一个接一个地运行两个爬虫?

时间:2014-12-10 19:04:42

标签: python scrapy

我在同一个项目中有两只蜘蛛。其中一个取决于另一个先运行。他们使用不同的管道。我怎样才能确保它们按顺序运行?

2 个答案:

答案 0 :(得分:2)

解决方案1 ​​

[Spider2 list] - 依赖于 - > [spider1 list]

在spider1成功完成后,让spider2运行怎么样:

#scrapy crawl Spider1 && scrapy crawl Spider2

解决方案2

个人[蜘蛛2] - 依赖于 - > [蜘蛛1项]

当你刮掉蜘蛛1物品时,你知道要抓住的个体蜘蛛2 url

如何将两个蜘蛛合并为一个?

请求meta属性。

spider.py

class MergedSpider(scrapy.Spider):
    # name, etc..
    def first_spider_parse(self, response):
        # your code...
        item = FirstSpiderItem()
        # yield the item first, and the pipeline will handle it
        yield item
        # then request the spider2 request
        yield scrapy.Request(secondSpiderItemURL, callback=self.second_spider_parse, dont_filter=True, meta={'firstItem': item})


    def second_spider_parse(self, response):
        item = SecondSpiderItem()
        firstItem = response.meta['firstItem']
        return item

pipelines.py

class FirstPipeline(object):
    def process_item(self, item, spider):
        # or you can isinstance the spider
        if isinstance(item, FirstSpiderItem):
            # your code
            pass
        return item


class SecondPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, SecondSpiderItem):
            # your code
            pass
        return item

答案 1 :(得分:2)

仅来自文档:https://doc.scrapy.org/en/1.2/topics/request-response.html

相同的示例,但通过链接延迟来顺序运行蜘蛛:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished