nutch在谷歌cloud-gloud dataproc上的Hadoop上

时间:2016-09-11 18:06:02

标签: hadoop nutch gcloud google-cloud-dataproc

当我尝试在google cloud(dataproc)上的hadoop上运行nutch时,我得到以下错误。我知道为什么会面临这个问题

user@cluster-1-m:~/apache-nutch-1.7/build$ hadoop jar /home/user/apache-nutch-1.7/runtime/deploy/apache-nutch-1.7.job org.apache.nutch.crawl.Crawl /tmp/testnutch/input/urls.txt -solr http://SOLRIP:8080/solr/ -depth 5 -topN2
  

16/09/11 17:57:38 INFO crawl.Crawl:抓取开始于:   crawl-20160911175737 16/09/11 17:57:38 INFO crawl.Crawl:rootUrlDir =   -topN2 16/09/11 17:57:38 INFO crawl.Crawl:threads = 10 16/09/11 17:57:38 INFO crawl.Crawl:depth = 5 16/09/11 17:57:38 INFO   crawl.Crawl:solrUrl = http://SOLRIP:8080/solr/ 16/09/11 17:57:38警告   conf.Configuration:无法在本地进行抓取/ 20160911175738   来自mapredu的目录ce.cluster.local.dir 16/09/11 17:57:38警告   conf.Configuration:   mapreduce.cluster.local.dir [0] = / hadoop / mapred / local异常   thread“main”java.io.IOException:没有有效的本地目录   property:mapreduce.cluster.local。 DIR           在org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:2302)           在org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:569)           在org.apache.nutch.crawl.Crawl.run(Crawl.java:123)           在org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)           在org.apache.nutch.crawl.Crawl.main(Crawl.java:55)           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)           at java.lang.reflect.Method.invoke(Method.java:498)           在org.apache.hadoop.util.RunJar.run(RunJar.java:221)           在org.apache.hadoop.util.RunJar.main(RunJar.java:136)

1 个答案:

答案 0 :(得分:0)

您收到此异常是因为您正在以默认情况下不在user组中的用户hadoop运行作业,因此驱动程序无法访问本地目录。请尝试以下方法:

sudo sudo -u mapred hadoop jar \
    /home/user/apache-nutch-1.7/runtime/deploy/apache-nutch-1.7.job \
    org.apache.nutch.crawl.Crawl /tmp/testnutch/input/urls.txt \
    -solr http://SOLRIP:8080/solr/ -depth 5 -topN2

或者,如果您想通过Dataproc jobs API提交而无需SSH进入群集,Dataproc也将以足够的权限运行:

gcloud dataproc jobs submit hadoop --cluster cluster-1 \
    --jar apache-nutch-1.7.jar \
    org.apache.nutch.crawl.Crawl /tmp/testnutch/input/urls.txt \
    -solr http://SOLRIP:8080/solr/ -depth 5 -topN2