如何通过nutch crawl SCRIPT设置topN

时间:2014-12-08 16:34:16

标签: solr web-crawler nutch

我正在尝试抓取一个网址,其网址为http://def.com/xyz/(say),其中包含超过2000个传出网址,但是当我查询solr时,它显示的文档少于50个,而我预计会在2000左右 文档。 我使用以下查询:

./crawl urls TestCrawl http://localhost:8983/solr/ -depth 2 -topN 3000          

控制台输出是:

Injector: starting at 2014-12-08 21:36:15
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 1
Injector: Total new urls injected: 0
Injector: finished at 2014-12-08 21:36:18, elapsed: 00:00:02

我假设nutch不能从抓取脚本中获取topN值。

1 个答案:

答案 0 :(得分:0)

请验证nutch配置中的属性db.max.outlinks.per.page。 将此值更改为更高的数字或更改为-1,以便抓取所有网址并将其编入索引。

希望这有帮助,

Le Quoc Do