我在Hadoop集群上运行Nutch,并且已爬网的数据存储在Cassandra集群中。运行Nutch作业时,出现以下错误:
java.lang.NullPointerException
at org.apache.avro.util.Utf8.<init>(Utf8.java:38)
at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
我这样开始Nutch的工作:
$HADOOP_HOME/bin/hadoop jar /nutch/apache-nutch-2.2.1.job org.apache.nutch.crawl.Crawl urls -dir crawl -depth 3 -topN 5