我在运行nutch时遇到问题 以下是我正在运行的命令
bin / nutch注入bin / crawl / crawldb bin / urls
运行上述命令后,出现以下错误
Injector: starting at 2014-04-02 13:02:29
Injector: crawlDb: bin/crawl/crawldb
Injector: urlDir: bin/urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 2
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.crawl.Injector.inject(Injector.java:294)
at org.apache.nutch.crawl.Injector.run(Injector.java:316)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:306)
我第一次跑nutch。 我检查了solr,nutch安装得当。
以下详细信息来自日志文件
java.io.IOException: The temporary job-output directory file:/usr/share/apache-nutch-1.8/bin/crawl/crawldb/1639805438/_temporary doesn't exist!
at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
at org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:46)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:449)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:491)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2014-04-02 12:54:46,251 ERROR crawl.Injector - Injector: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.crawl.Injector.inject(Injector.java:294)
at org.apache.nutch.crawl.Injector.run(Injector.java:316)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Injector.main(Injector.java:306)
答案 0 :(得分:0)
使用bin / nutch注入bin / crawl / crawldb bin / urls命令注入
而不是bin / nutch注入crawl / crawldb bin / urls
解决了这个错误。
并且对于获取网址我已经对regex-urlfilter.txt文件进行了更改,现在我可以获取网址。
答案 1 :(得分:0)
请确保您的任何配置配置文件中都没有语法错误。