Chain Scrapy Spiders在扭曲的反应堆中具有数据依赖性

时间:2018-02-15 11:21:04

标签: python asynchronous scrapy twisted yield

实际上,scrapy文档解释了如何链接这样的两个spyder

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()


@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

但在我的用例中,MySpider2需要使用MySpider1转换后transformFunction()检索的信息。

所以我想要这样的东西:

def transformFunction():
    ... transforme data retrieved by spyder1 ...
    return newdata

def crawl():
    yield runner.crawl(MySpider1)
    newdata = transformFunction()
    yield runner.crawl(MySpider2, data=newData)
    reactor.stop()

我想安排的是什么:

  1. MySpider1开始,在磁盘上写data然后退出
  2. transformFunction()data转换为newdata
  3. MySpider2开始使用newData
  4. 那么如何使用扭曲反应器和scrapy来管理这种行为?

1 个答案:

答案 0 :(得分:1)

runner.crawl会返回Deferred,因此您可以将回调链接到它。必须对您的代码进行一些小的调整。

from twisted.internet import task
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

configure_logging()

def crawl(reactor):
    runner = CrawlerRunner()
    d = runner.crawl(MySpider1)
    d.addCallback(transformFunction)
    d.addCallback(crawl2, runner)
    return d

def transformFunction(result):
    # crawl doesn't usually return any results if successful so ignore result var here
    # ...
    return newdata

def crawl2(result, runner):
    # result == newdata from transformFunction
    # runner is passed in from crawl()
    return runner.crawl(MySpider2, data=result)

task.react(crawl)

主要功能是crawl(),它由task.react()执行,它将为您启动和停止反应堆。从Deferred返回runner.crawl(),并将transformFunction + crawl2函数链接到该函数,以便在函数完成时,下一个函数启动。