Question

我在Nutch 2.3.1中使用以下命令和MongoDB存储。在进行爬网时，按CTRL + C进行处理。之后，如果我尝试运行相同的爬网脚本，它不会简单地破坏而没有任何错误。它在第二次迭代中退出。

使用的命令：runtime / local / bin / crawl urls /'crawlDb'10

输出：

ParserJob: finished at 2018-03-02 19:48:31, time elapsed: 00:00:02
CrawlDB update for crawlDb
/Users/rajeevprasanna/Desktop/nutch-cassandra/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1520000291-27137 -crawlId crawlDb
DbUpdaterJob: starting at 2018-03-02 19:48:31
DbUpdaterJob: batchId: 1520000291-27137
DbUpdaterJob: finished at 2018-03-02 19:48:34, time elapsed: 00:00:02
Skipping indexing tasks: no SOLR url provided.
Fri Mar 2 19:48:34 IST 2018 : Iteration 2 of 10
Generating batchId
Generating a new fetchlist
/Users/rajeevprasanna/Desktop/nutch-cassandra/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId crawlDb -batchId 1520000314-30627
GeneratorJob: starting at 2018-03-02 19:48:34
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2018-03-02 19:48:37, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1520000314-30627 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
Rajeevs-MacBook-Pro:apache-nutch-2.3.1 rajeevprasanna$

Answer 1

原因如下所示：＆＃34;现在没有更多网址可以获取＆＃34;。 Web表中没有新的未获取链接。要从头开始，需要删除MongoDb中的CrawlDb（Web表）。

Answer 2

这些命令有一个-resume参数，它应该可以工作。

Nutch2没有恢复爬行

2 个答案: