我需要在调用Tornado Scrapy Spider
请求时运行get
。我第一次调用Tornado
请求时,蜘蛛运行正常,但是当我向Tornado
发出另一个请求时,蜘蛛不会运行并引发以下错误:
Traceback (most recent call last):
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/tornado/web.py", line 1413, in _execute
result = method(*self.path_args, **self.path_kwargs)
File "server.py", line 38, in get
process.start()
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
ReactorBase.startRunning(self)
File "/Users/Sebastian/anaconda/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable
龙卷风方法是:
class PageHandler(tornado.web.RequestHandler):
def get(self):
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
})
process.crawl(YourSpider)
process.start()
self.write(json.dumps(results))
因此,我们的想法是始终调用DirectoryHandler
方法,蜘蛛会运行并执行爬行。
答案 0 :(得分:1)
经过google很长一段时间后,我终于得到了解决这个问题的答案...... 有一个库 scrapydo (https://github.com/darkrho/scrapydo),它基于croched并阻止反应器,允许每次都重复使用同一个蜘蛛。< / p>
所以要解决你需要安装库的问题,然后调用setup方法一次,然后使用 run_spider 方法...代码如下:
import scrapydo
scrapydo.setup()
class PageHandler(tornado.web.RequestHandler):
def get(self):
scrapydo.run_spider(YourSpider(), settings={
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES' : {'__main__.ResultsPipeline': 1}
})
self.write(json.dumps(results))
希望这可以帮助任何有同样问题的人!