使用:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
我总是成功地运行这个过程:
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
但是因为我已将此代码移动到web_crawler(self)
函数中,如下所示:
def web_crawler(self):
# set up a crawler
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
# (...)
return (result1, result2)
并开始使用类实例化调用该方法,如:
def __call__(self):
results1 = test.web_crawler()[1]
results2 = test.web_crawler()[0]
并且正在运行:
test()
我收到以下错误:
Traceback (most recent call last):
File "test.py", line 573, in <module>
print (test())
File "test.py", line 530, in __call__
artists = test.web_crawler()
File "test.py", line 438, in web_crawler
process.start()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
ReactorBase.startRunning(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
出了什么问题?
答案 0 :(得分:15)
您无法重启反应堆,但您可以通过分支单独的流程运行它多次:
import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
# the wrapper to make it run more times
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
运行两次:
print('first run:')
run_spider(QuotesSpider)
print('\nsecond run:')
run_spider(QuotesSpider)
结果:
first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
答案 1 :(得分:9)
这有助于我赢得针对ReactorNotRestartable错误的战斗:last answer from the author of the question
0)pip install crochet
1)import from crochet import setup
2)setup()
- 位于文件的顶部
3)删除2行:
a)d.addBoth(lambda _: reactor.stop())
b)reactor.run()
我遇到了同样的错误问题,花了4个多小时来解决这个问题,请阅读有关它的所有问题。终于找到了一个 - 并分享它。这就是我解决这个问题的方法。来自Scrapy docs的唯一有意义的行是我的代码中的最后两行:
#some more imports
from crochet import setup
setup()
def run_spider(spiderName):
module_name="first_scrapy.spiders.{}".format(spiderName)
scrapy_var = import_module(module_name) #do some dynamic import of selected spider
spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj) #from Scrapy docs
此代码允许我选择将其名称传递给run_spider
函数的蜘蛛,并在废弃完成后再选择另一个蜘蛛并再次运行它。
希望这会对某人有所帮助,因为它对我有帮助:)。
答案 2 :(得分:1)
根据Scrapy documentation,start()
类的CrawlerProcess
方法执行以下操作:
&#34; [...]启动Twisted reactor,将其池大小调整为REACTOR_THREADPOOL_MAXSIZE,并安装基于DNSCACHE_ENABLED和DNSCACHE_SIZE的DNS缓存。&#34;
Twisted
正在抛出您收到的错误,因为无法重新启动Twisted reactor。它使用了大量的全局变量,即使你使用某种代码重新启动它(我已经看过它完成了),也不能保证它能够正常工作。
老实说,如果你认为你需要重新启动反应堆,你可能会做错事。
根据您的要求,我也会审核文档的Running Scrapy from a Script部分。
答案 3 :(得分:1)
错误在于此代码:
def __call__(self):
result1 = test.web_crawler()[1]
result2 = test.web_crawler()[0] # here
web_crawler()
返回两个结果,为此目的,它试图启动该过程两次,重新启动Reactor,如@Rejected指出的那样。
获取运行单个进程的结果,并将结果存储在元组中,是这里的方法:
def __call__(self):
result1, result2 = test.web_crawler()
答案 4 :(得分:0)
这解决了我的问题,在reactor.run()
或process.start()
之后放在代码下面:
time.sleep(0.5)
os.execl(sys.executable, sys.executable, *sys.argv)
答案 5 :(得分:0)
正如某些人已经指出的那样:您不需要重新启动反应堆。
理想情况下,如果要链接流程(crawl1然后crawl2然后crawl3),只需添加回调即可。
例如,我一直在使用遵循以下模式的循环蜘蛛:
1. Crawl A
2. Sleep N
3. goto 1
这就是它看起来很scrap的样子:
import time
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.body)
def sleep(_, duration=5):
print(f'sleeping for: {duration}')
time.sleep(duration) # block here
def crawl(runner):
d = runner.crawl(HttpbinSpider)
d.addBoth(sleep)
d.addBoth(lambda _: crawl(runner))
return d
def loop_crawl():
runner = CrawlerRunner(get_project_settings())
crawl(runner)
reactor.run()
if __name__ == '__main__':
loop_crawl()
为进一步说明该过程,crawl
函数安排了爬网并添加了两个在爬网结束时要调用的回调:阻止睡眠和对自身的递归调用(安排另一个爬网)。
$ python endless_crawl.py
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5
b'{\n "origin": "000.000.000.000"\n}\n'
sleeping for: 5