生成一组要在Nutch中获取的新URL的问题。
这是我正在使用的命令:
# $NUTCH_HOME/runtime/local/bin/nutch generate -topN 10
结果:
Generator: starting at 2016-12-14 19:16:50
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: running in local mode, generating exactly one partition.
Generator: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/root/bob/-topN/current
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:329)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:862)
at org.apache.nutch.crawl.Generator.generate(Generator.java:589)
at org.apache.nutch.crawl.Generator.run(Generator.java:764)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Generator.main(Generator.java:717)
我错过了什么?
答案 0 :(得分:0)
您的命令中缺少参数,请查看https://wiki.apache.org/nutch/bin/nutch%20generate基本上您需要提供crawldb和segment子目录的路径。