nutch索引器FileNotFoundException:数据不存在

时间:2017-10-27 04:11:10

标签: indexing solr nutch

我正在运行nutch来抓取并索引solr。运行bin / nutch时,我收到以下错误:

Indexer: java.io.FileNotFoundException: File file:/opt/nutch/crawl/linkdb/current/linkdb-merge-1124746471/data does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:47)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(AccessController.java:488)
at javax.security.auth.Subject.doAs(Subject.java:572)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)

所以,它抱怨:/ opt / nutch / crawl / linkdb / current / linkdb-merge-1124746471 / data不存在。 但是,/ opt / nutch / crawl / linkdb / current / linkdb-merge-1124746471 / part00000 / data确实存在。 这种差异怎么会发生?即,我在哪里可以配置索引过程,以便索引器可以找到上一步创建的文件?

非常感谢任何帮助或提示!

1 个答案:

答案 0 :(得分:0)

需要删除文件夹.../linkdb/current/linkdb-merge-1124746471/。它是来自“invertlinks”或“mergelinkdb”作业的临时文件夹。它不在正确的地方:应该是.../linkdb/linkdb-merge-1124746471/。如果使用.../linkdb/调用作业而不是.../linkdb/current/,则可能会发生这种情况,因为linkdb的命名方式没有限制。