如何在CrawlerProcess完成后(即在process.start()之后的行中)获取统计信息值

时间:2018-10-23 03:23:47

标签: python scrapy statistics web-crawler

我正在蜘蛛内的某个地方使用此代码:

raise scrapy.exceptions.CloseSpider('you_need_to_rerun')

因此,当出现此异常时,最终我的蜘蛛关闭工作,并使用以下字符串进入控制台状态:

'finish_reason': 'you_need_to_rerun',

但是-如何从代码中获取它?因为我想基于此统计信息再次循环运行Spider,如下所示:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import spaida.spiders.spaida_spider
import spaida.settings


you_need_to_rerun = True
while you_need_to_rerun:
    process = CrawlerProcess(get_project_settings())
    process.crawl(spaida.spiders.spaida_spider.SpaidaSpiderSpider)
    process.start(stop_after_crawl=False)  # the script will block here until the crawling is finished
    finish_reason = 'and here I get somehow finish_reason from stats' # <- how??
    if finish_reason == 'finished':
        print("everything ok, I don't need to rerun this")
        you_need_to_rerun = False

我在文档中发现了这个东西,但无法正确理解,它是“可以通过spider_stats属性访问此统计信息,该属性是蜘蛛域名的键。”:https://doc.scrapy.org/en/latest/topics/stats.html#scrapy.statscollectors.MemoryStatsCollector.spider_stats < / p>

PS:使用process.start()时也会出现扭曲twist.internet.error.ReactorNotRestartable的错误,以及使用process.start(stop_after_crawl=False)的建议-然后蜘蛛只会停下来什么也不做,但这是另一个问题...

1 个答案:

答案 0 :(得分:0)

您需要通过Crawler对象访问统计信息对象:

process = CrawlerProcess(get_project_settings())
crawler = process.crawlers[0]
reason = crawler.stats.get_value('finish_reason')