Nutch术语中的“段”究竟是什么?

时间:2013-03-12 20:21:23

标签: apache web-crawler nutch

我刚开始使用Nutch 1.6。我进行了初步爬行,直到我遇到以下问题:

  

LinkDb:添加细分:   file:/ var / apache-nutch / crawl / segments / 2013031234747 LinkDb:添加   segment:file:/ var / apache-nutch / crawl / segments / 2013031250939异常   在线程“main”org.apache.hadoop.mapred.InvalidInputException:输入   路径不存在:   file:/ var / apache-nutch / crawl / segments / 20130308114306 / parse_data输入   路径不存在:   文件:在/ var / Apache的Nutch的/爬行/段/ 20130312135244 / parse_data       在org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)       at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)       在org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)       在org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)       在org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)       在org.apache.hadoop.mapred.JobClient.access $ 600(JobClient.java:174)       在org.apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:897)       在org.apache.hadoop.mapred.JobClient $ 2.run(JobClient.java:850)       at java.security.AccessController.doPrivileged(Native Method)       在javax.security.auth.Subject.doAs(Subject.java:415)       在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)       在org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)       在org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)       在org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)       在org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:180)       在org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:151)       在org.apache.nutch.crawl.Crawl.run(Crawl.java:143)       在org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)       在org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

我想了解nutch中究竟是什么“分段”?在上述错误的开头,它说“LinkdB:添加段......”它试图做什么?我们分段的是什么?

1 个答案:

答案 0 :(得分:2)

Segment是一个分区[hadoop输入分区],由nutch运行的map reduce作业创建,开始从给予crawler的种子URL的输入集爬行到抓取。