Scrapy - Reactor无法重启

时间:2017-01-05 21:32:14

标签: python scrapy web-crawler

使用:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess

我总是成功地运行这个过程:

process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start() 

但是因为我已将此代码移动到web_crawler(self)函数中,如下所示:

def web_crawler(self):
    # set up a crawler
    process = CrawlerProcess(get_project_settings())
    process.crawl(*args)
    # the script will block here until the crawling is finished
    process.start() 

    # (...)

    return (result1, result2) 

并开始使用类实例化调用该方法,如:

def __call__(self):
    results1 = test.web_crawler()[1]
    results2 = test.web_crawler()[0]

并且正在运行:

test()

我收到以下错误:

Traceback (most recent call last):
  File "test.py", line 573, in <module>
    print (test())
  File "test.py", line 530, in __call__
    artists = test.web_crawler()
  File "test.py", line 438, in web_crawler
    process.start() 
  File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
    ReactorBase.startRunning(self)
  File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

出了什么问题?

6 个答案:

答案 0 :(得分:15)

您无法重启反应堆,但您可以通过分支单独的流程运行它多次:

import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())


# the wrapper to make it run more times
def run_spider(spider):
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(spider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result

运行两次:

print('first run:')
run_spider(QuotesSpider)

print('\nsecond run:')
run_spider(QuotesSpider)

结果:

first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...

答案 1 :(得分:9)

这有助于我赢得针对ReactorNotRestartable错误的战斗:last answer from the author of the question
0)pip install crochet
1)import from crochet import setup
2)setup() - 位于文件的顶部
3)删除2行:
a)d.addBoth(lambda _: reactor.stop())
b)reactor.run()

我遇到了同样的错误问题,花了4个多小时来解决这个问题,请阅读有关它的所有问题。终于找到了一个 - 并分享它。这就是我解决这个问题的方法。来自Scrapy docs的唯一有意义的行是我的代码中的最后两行:

#some more imports
from crochet import setup
setup()

def run_spider(spiderName):
    module_name="first_scrapy.spiders.{}".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj=scrapy_var.mySpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)                          #from Scrapy docs

此代码允许我选择将其名称传递给run_spider函数的蜘蛛,并在废弃完成后再选择另一个蜘蛛并再次运行它。 希望这会对某人有所帮助,因为它对我有帮助:)。

答案 2 :(得分:1)

根据Scrapy documentationstart()类的CrawlerProcess方法执行以下操作:

  

&#34; [...]启动Twisted reactor,将其池大小调整为REACTOR_THREADPOOL_MAXSIZE,并安装基于DNSCACHE_ENABLED和DNSCACHE_SIZE的DNS缓存。&#34;

Twisted正在抛出您收到的错误,因为无法重新启动Twisted reactor。它使用了大量的全局变量,即使你使用某种代码重新启动它(我已经看过它完成了),也不能保证它能够正常工作。

老实说,如果你认为你需要重新启动反应堆,你可能会做错事。

根据您的要求,我也会审核文档的Running Scrapy from a Script部分。

答案 3 :(得分:1)

错误在于此代码:

def __call__(self):
    result1 = test.web_crawler()[1]
    result2 = test.web_crawler()[0] # here

web_crawler()返回两个结果,为此目的,它试图启动该过程两次,重新启动Reactor,如@Rejected指出的那样。

获取运行单个进程的结果,并将结果存储在元组中,是这里的方法:

def __call__(self):
    result1, result2 = test.web_crawler()

答案 4 :(得分:0)

这解决了我的问题,在reactor.run()process.start()之后放在代码下面:

time.sleep(0.5)

os.execl(sys.executable, sys.executable, *sys.argv)

答案 5 :(得分:0)

正如某些人已经指出的那样:您不需要重新启动反应堆。

理想情况下,如果要链接流程(crawl1然后crawl2然后crawl3),只需添加回调即可。

例如,我一直在使用遵循以下模式的循环蜘蛛:

1. Crawl A
2. Sleep N
3. goto 1

这就是它看起来很scrap的样子:

import time

from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/ip']

    def parse(self, response):
        print(response.body)

def sleep(_, duration=5):
    print(f'sleeping for: {duration}')
    time.sleep(duration)  # block here


def crawl(runner):
    d = runner.crawl(HttpbinSpider)
    d.addBoth(sleep)
    d.addBoth(lambda _: crawl(runner))
    return d


def loop_crawl():
    runner = CrawlerRunner(get_project_settings())
    crawl(runner)
    reactor.run()


if __name__ == '__main__':
    loop_crawl()

为进一步说明该过程,crawl函数安排了爬网并添加了两个在爬网结束时要调用的回调:阻止睡眠和对自身的递归调用(安排另一个爬网)。

$ python endless_crawl.py 
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n  "origin": "000.000.000.000"\n}\n'
sleeping for: 5