如何让一个scrapy蜘蛛从龙卷风请求中多次运行

时间:2015-09-03 09:32:21

标签: python scrapy tornado

我需要在调用Tornado Scrapy Spider请求时运行get。我第一次调用Tornado请求时,蜘蛛运行正常,但是当我向Tornado发出另一个请求时,蜘蛛不会运行并引发以下错误:

Traceback (most recent call last):
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/tornado/web.py", line 1413, in _execute
        result = method(*self.path_args, **self.path_kwargs)
    File "server.py", line 38, in get
        process.start()
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
        reactor.run(installSignalHandlers=False)  # blocking call
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
        self.startRunning(installSignalHandlers=installSignalHandlers)
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
        ReactorBase.startRunning(self)
    File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
        raise error.ReactorNotRestartable()
ReactorNotRestartable

龙卷风方法是:

class PageHandler(tornado.web.RequestHandler):

    def get(self):

        process = CrawlerProcess({
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
            'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
        })

        process.crawl(YourSpider)
        process.start()

        self.write(json.dumps(results))

因此,我们的想法是始终调用DirectoryHandler方法,蜘蛛会运行并执行爬行。

1 个答案:

答案 0 :(得分:1)

经过google很长一段时间后,我终于得到了解决这个问题的答案...... 有一个库 scrapydo https://github.com/darkrho/scrapydo),它基于croched并阻止反应器,允许每次都重复使用同一个蜘蛛。< / p>

所以要解决你需要安装库的问题,然后调用setup方法一次,然后使用 run_spider 方法...代码如下:

import scrapydo
scrapydo.setup()


class PageHandler(tornado.web.RequestHandler):

    def get(self):

        scrapydo.run_spider(YourSpider(), settings={
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
            'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
        })

        self.write(json.dumps(results))

希望这可以帮助任何有同样问题的人!