我已成功运行Nutch(v1.4)在我的Ubuntu 11.10系统上使用本地模式进行爬网。但是,当切换到“部署”模式(所有其他方面都相同)时,我在获取周期中收到错误。
我在机器上成功运行Hadoop,采用伪分布式模式(复制因子为1,我只有1个映射,1个减少作业设置)。 “jps”显示所有Hadoop守护程序都已启动并正在运行。 18920 Jps 14799 DataNode 15127 JobTracker 14554 NameNode 15361 TaskTracker 15044 SecondaryNameNode
我还在我的PATH变量中添加了HADOOP_HOME / bin路径。
PATH = $ PATH:/家庭/ jimb / hadoop的/ bin中
然后我从nutch / deploy目录运行了爬网,如下所示:
bin / nutch crawl / data / runs / ar / seedurls -dir / data / runs / ar / crawls
以下是我得到的输出:
12/01/25 13:55:49 INFO crawl.Crawl: crawl started in: /data/runs/ar/crawls
12/01/25 13:55:49 INFO crawl.Crawl: rootUrlDir = /data/runs/ar/seedurls
12/01/25 13:55:49 INFO crawl.Crawl: threads = 10
12/01/25 13:55:49 INFO crawl.Crawl: depth = 5
12/01/25 13:55:49 INFO crawl.Crawl: solrUrl=null
12/01/25 13:55:49 INFO crawl.Injector: Injector: starting at 2012-01-25 13:55:49
12/01/25 13:55:49 INFO crawl.Injector: Injector: crawlDb: /data/runs/ar/crawls/crawldb
12/01/25 13:55:49 INFO crawl.Injector: Injector: urlDir: /data/runs/ar/seedurls
12/01/25 13:55:49 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
12/01/25 13:56:53 INFO mapred.FileInputFormat: Total input paths to process : 1
...
...
12/01/25 13:57:21 INFO crawl.Injector: Injector: Merging injected urls into crawl db.
...
12/01/25 13:57:48 INFO crawl.Injector: Injector: finished at 2012-01-25 13:57:48, elapsed: 00:01:59
12/01/25 13:57:48 INFO crawl.Generator: Generator: starting at 2012-01-25 13:57:48
12/01/25 13:57:48 INFO crawl.Generator: Generator: Selecting best-scoring urls due for fetch.
12/01/25 13:57:48 INFO crawl.Generator: Generator: filtering: true
12/01/25 13:57:48 INFO crawl.Generator: Generator: normalizing: true
12/01/25 13:57:48 INFO mapred.FileInputFormat: Total input paths to process : 2
...
12/01/25 13:58:15 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
12/01/25 13:58:16 INFO crawl.Generator: Generator: segment: /data/runs/ar/crawls/segments/20120125135816
...
12/01/25 13:58:42 INFO crawl.Generator: Generator: finished at 2012-01-25 13:58:42, elapsed: 00:00:54
12/01/25 13:58:42 ERROR fetcher.Fetcher: Fetcher: No agents listed in 'http.agent.name' property.
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1261)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1166)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
现在,“本地”模式的配置文件设置正常(因为本地模式下的爬网成功)。为了在部署模式下运行,由于“deploy”文件夹没有任何“conf”子目录,我假设: a)需要在“deploy / conf”,OR下复制conf文件 b)需要将conf文件放在HDFS上。
我已经确认上述选项(a)无效。那么,我假设NFS配置文件需要存在于HDFS中,以便HDFS提取器成功运行?但是,我不知道在HDFS中我应该放置这些Nutch conf文件的路径,或者我可能正在咆哮错误的树?
如果Nutch在“部署”模式下从“local / conf”下的文件中读取配置文件,那么为什么本地爬网工作正常,但部署模式爬行不是?
我在这里缺少什么?
提前致谢!
答案 0 :(得分:2)
试试这个:
在nutch源目录中,修改文件conf/nutch-site.xml
以正确设置http.agent.name
。
使用ant
转到runtime/deploy
目录,设置所需的环境变量并再次尝试抓取。
答案 1 :(得分:1)
这可能是因为你尚未重建。你能跑“蚂蚁”,看看会发生什么?显然,如果你还没有这样做,你需要更新nutch-site.xml中的http.agent.name。