使用Nutch进行爬网时出错 - 输入路径不存在:hdfs://.../urls/seed.txt

时间:2014-07-16 19:51:49

标签: hadoop nutch emr web-crawler

我安装了Apache Nutch并使用以下命令运行抓取:

bin/crawl ./urls/seed.txt crawl http://localhost:8983/solr/ 5

从运行时/本地工作正常。当我从runtime / deploy运行相同的命令时,我得到:

14/07/16 19:43:35 INFO crawl.InjectorJob: InjectorJob: starting at 2014-07-16 19:43:35
14/07/16 19:43:35 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: urls/seed.txt
14/07/16 19:43:37 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/07/16 19:43:37 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector
14/07/16 19:43:37 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
14/07/16 19:43:37 INFO mapred.JobClient: Default number of map tasks: null
14/07/16 19:43:37 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 12
14/07/16 19:43:37 INFO mapred.JobClient: Default number of reduce tasks: 0
14/07/16 19:43:38 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/07/16 19:43:38 INFO mapred.JobClient: Setting group to hadoop
14/07/16 19:43:39 INFO mapred.JobClient: Cleaning up the staging area hdfs://172.31.13.61:9000/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_201407161337_0024
14/07/16 19:43:39 ERROR security.UserGroupInformation: PriviledgedActionException as:hadoop cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://172.31.13.61:9000/user/hadoop/urls/seed.txt
14/07/16 19:43:39 ERROR crawl.InjectorJob: InjectorJob: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://172.31.13.61:9000/user/hadoop/urls/seed.txt
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1016)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1033)
    at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:174)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:951)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:904)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1140)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:904)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:501)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:531)
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:50)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
    at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
    at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

它没有找到文件seed.txt,是的,它确实存在于$ HOME / urls / seed.txt中。我正在使用AWS EMR和Cassandra。任何帮助将不胜感激。

0 个答案:

没有答案