Question

我有一些看起来像这样的代码：

def run(spider_name, settings):
    runner = CrawlerProcess(settings)
    runner.crawl(spider_name)
    runner.start()
    return True

我有两个py.test测试，每个调用run（），当第二个测试执行时，我得到以下错误。

    runner.start()
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/scrapy/crawler.py:291: in start
    reactor.run(installSignalHandlers=False)  # blocking call
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1242: in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1222: in startRunning
    ReactorBase.startRunning(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <twisted.internet.selectreactor.SelectReactor object at 0x10fe21588>

    def startRunning(self):
        """
            Method called when reactor starts: do some initialization and fire
            startup events.

            Don't call this directly, call reactor.run() instead: it should take
            care of calling this.

            This method is somewhat misnamed.  The reactor will not necessarily be
            in the running state by the time this method returns.  The only
            guarantee is that it will be on its way to the running state.
            """
        if self._started:
            raise error.ReactorAlreadyRunning()
        if self._startedBefore:
>           raise error.ReactorNotRestartable()
E           twisted.internet.error.ReactorNotRestartable

我得到这个反应堆的东西已经在运行，所以当第二次测试运行时我不能runner.start()。但是有没有办法在测试之间重置其状态？所以它们更加孤立，实际上可以相继运行。

Answer 1

According to the scrapy docs:

默认情况下，Scrapy在运行时会为每个进程运行一个蜘蛛 scrapy爬行。但是，Scrapy支持每个运行多个蜘蛛使用内部API的流程。

例如：

function Sprite() {
    this.img = new Image();
    // either this:
    this.img.onload = function() {
    }
    // or this:
    this.img.addEventListener("load", function() {
        // Either way, when it loads, then it could proceed with the move or draw methods.
        // I don't even know how to do this yet...
    });

    this.img.src = "res/image.png";

    this.draw = function() {
        context.drawImage(//with all the parameters);
    }

    this.move = function() {
        // code that makes the sprites move
    }
}

如果您想在调用import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished之后再运行另一个蜘蛛，那么我希望您可以在程序中确定需要执行此操作时再发出process.start次呼叫

其他方案的示例是in the docs。

Answer 2

如果您将CrawlerRunner代替CrawlerProcess与pytest-twisted结合使用，则应该能够像这样运行测试：

为Pytest安装Twisted集成：pip install pytest-twisted

from scrapy.crawler import CrawlerRunner

def _run_crawler(spider_cls, settings):
    """
    spider_cls: Scrapy Spider class
    settings: Scrapy settings
    returns: Twisted Deferred
    """
    runner = CrawlerRunner(settings)
    return runner.crawl(spider_cls)     # return Deferred


def test_scrapy_crawler():
    deferred = _run_crawler(MySpider, settings)

    @deferred.addCallback
    def _success(results):
        """
        After crawler completes, this function will execute.
        Do your assertions in this function.
        """

    @deferred.addErrback
    def _error(failure):
        raise failure.value

    return deferred

说实话，_run_crawler()将在Twisted反应器中安排爬行并在刮擦完成时执行回调。在那些回调（_success()和_error()）中，您将执行断言。最后，您必须从Deferred返回_run_crawler()对象，以便测试等待爬网完成。这部分Deferred是必不可少的，必须为所有测试完成。

以下是使用gatherResults运行多次抓取和汇总结果的示例。

from twisted.internet import defer

def test_multiple_crawls():
    d1 = _run_crawler(Spider1, settings)
    d2 = _run_crawler(Spider2, settings)

    d_list = defer.gatherResults([d1, d2])

    @d_list.addCallback
    def _success(results):
        assert True

    @d_list.addErrback
    def _error(failure):
        assert False

    return d_list

我希望这会有所帮助，如果它没有问你在哪里挣扎。

当CrawlerProcess运行两次时，Scrapy会引发ReactorNotRestartable

2 个答案: