我使用Nutch 2.2.1,4.3.0和HBase 0.90.4 SOLR。
我收到以下错误。
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Exception in thread "main" java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local1662982347_0002
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:199)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:152)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)
在Hadoop日志中有如下内容。
2014-08-11 09:13:43,246 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
2014-08-11 09:13:43,293 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2014-08-11 09:13:43,372 WARN snappy.LoadSnappy - Snappy native library not loaded
2014-08-11 09:13:44,017 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000
2014-08-11 09:13:44,245 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2014-08-11 09:13:44,381 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2014-08-11 09:13:44,686 INFO crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 0
2014-08-11 09:13:44,686 INFO crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 1
2014-08-11 09:13:44,695 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-08-11 09:13:44,696 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2014-08-11 09:13:44,696 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2014-08-11 09:13:45,392 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2014-08-11 09:13:45,501 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-08-11 09:13:45,501 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000
2014-08-11 09:13:45,501 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2014-08-11 09:13:45,547 INFO regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2014-08-11 09:13:45,654 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000
2014-08-11 09:13:45,670 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2014-08-11 09:13:45,671 WARN mapred.LocalJobRunner - job_local1662982347_0002
java.lang.NullPointerException
at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
不幸的是我不知道我做错了什么。
我已经在“使用Apache Nutch进行Web爬行和数据挖掘”一书中实现了所有功能。
不幸的是回来错误。目前我很遗憾无能为力。
答案 0 :(得分:0)
这是失败的一行:
batchId = new Utf8(conf.get(GeneratorJob.BATCH_ID));
你是如何运作的?如果我没有错,则不推荐使用crawl
命令,现在generate
需要批处理ID;至少,它发生在我以前。现在有了开发分支似乎工作正常,即使你没有设置批处理ID ...
来自http://wiki.apache.org/nutch/Nutch2Tutorial:
N.B。不推荐使用bin / nutch脚本中的crawl命令。您 应该使用单独的命令或者使用bin / crawl 脚本...有效地将各个命令链接在一起。