如何从Scrapy获取已经被抓取的URL数量(request_count)?

时间:2016-11-18 14:51:27

标签: python python-2.7 scrapy scrapy-spider

Scrapy在运行代码时显示这样的统计数据

2016-11-18 06:41:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 656,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2661,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 11, 18, 14, 41, 38, 759760),
 'item_scraped_count': 2,
 'log_count/DEBUG': 5,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 11, 18, 14, 41, 37, 807590)}

我的目标是访问response_count中的request_countprocess_response或Spider中的任何方法。

我想在蜘蛛抓取N个总网址时关闭蜘蛛。

1 个答案:

答案 0 :(得分:1)

如果您想根据已完成的请求数量关闭蜘蛛,我建议您在CLOSESPIDER_PAGECOUNT中使用[settings.py] :( https://doc.scrapy.org/en/latest/topics/extensions.html#closespider-pagecount

<强> settings.py

CLOSESPIDER_PAGECOUNT= 20 # so end after 20 pages have been crawled

如果你想在蜘蛛内部访问Scrapy Stats,你可以这样做:

self.crawler.stats.get_value('my_stat_name') # change it to `response_count` or `request_count`