在Nutch抓取结束时,是否可以找到或确定Nutch实际抓取的网页数量?
答案 0 :(得分:2)
使用带有-stats的readdb命令,这将为您提供每个状态的细分
答案 1 :(得分:1)
您可以使用readdb
bin/nutch readdb crawl/crawldb -stats
示例:bin/nutch readdb crawl/dabfolder/crawldb -stats
输出如下:
Statistics for CrawlDb: crawl/dabfolder/crawldb/
TOTAL urls: 563390
shortest fetch interval: 30 days, 00:00:00
avg fetch interval: 30 days, 00:43:49
longest fetch interval: 45 days, 00:00:00
earliest fetch time: Fri Jun 02 11:57:00 IST 2017
avg of fetch times: Sun Jun 04 14:46:00 IST 2017
latest fetch time: Mon Jul 17 11:54:00 IST 2017
retry 0: 560279
retry 1: 3111
min score: 0.0
avg score: 0.1028828
max score: 195.854
status 1 (db_unfetched): 524278
status 2 (db_fetched): 17615
status 3 (db_gone): 1143
status 4 (db_redir_temp): 8428
status 5 (db_redir_perm): 11800
status 7 (db_duplicate): 126
CrawlDb statistics: done