有没有人设法在Hadoop 2群集上运行Nutch 2.3.1?我一直试图在我的Hadoop / Yarn 2.7.1集群上运行Nutch 2.3.1大约两天了。
首先,我的Nutch只在本地安装,而不是在所有节点上安装。我将HBase设置为存储引擎。
最初,在集群上下载并尝试它是失败的,因为它无法在工作端找到一些库,我通过修改runtime/local/bin/nutch
脚本来解决,以便在发送要执行的jar时包含所有库:
LIBJARS="$NUTCH_HOME"/lib/apache-nutch-2.3.1.jar
for f in "$NUTCH_HOME"/lib/*.jar; do
LIBJARS="${LIBJARS},$f";
done
# run it
exec "${EXEC_CALL[@]}" $CLASS -libjars $LIBJARS "$@"
然而,在解决此问题后,我遇到以下错误,我不知道如何解决:
InjectorJob: starting at 2016-05-11 10:37:46
InjectorJob: Injecting urlDir: /user/ubuntu/urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
Error: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.net.MalformedURLException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:118)
at org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:88)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:767)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.RuntimeException: java.net.MalformedURLException
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:132)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 10 more
Caused by: java.net.MalformedURLException
at java.net.URL.<init>(URL.java:630)
at java.net.URL.<init>(URL.java:493)
at java.net.URL.<init>(URL.java:442)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:518)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:865)
at org.apache.gora.hbase.store.HBaseStore.readMapping(HBaseStore.java:719)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:116)
... 12 more
Caused by: java.lang.NullPointerException
at java.net.URL.<init>(URL.java:535)
... 25 more
Error: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.net.MalformedURLException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:118)
at org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:88)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:767)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.RuntimeException: java.net.MalformedURLException
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:132)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 10 more
Caused by: java.net.MalformedURLException
at java.net.URL.<init>(URL.java:630)
at java.net.URL.<init>(URL.java:493)
at java.net.URL.<init>(URL.java:442)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:518)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:865)
at org.apache.gora.hbase.store.HBaseStore.readMapping(HBaseStore.java:719)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:116)
... 12 more
Caused by: java.lang.NullPointerException
at java.net.URL.<init>(URL.java:535)
... 25 more
Error: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.net.MalformedURLException
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:118)
at org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:88)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:767)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.RuntimeException: java.net.MalformedURLException
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:132)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 10 more
Caused by: java.net.MalformedURLException
at java.net.URL.<init>(URL.java:630)
at java.net.URL.<init>(URL.java:493)
at java.net.URL.<init>(URL.java:442)
at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:518)
at org.jdom.input.SAXBuilder.build(SAXBuilder.java:865)
at org.apache.gora.hbase.store.HBaseStore.readMapping(HBaseStore.java:719)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:116)
... 12 more
Caused by: java.lang.NullPointerException
at java.net.URL.<init>(URL.java:535)
... 25 more
InjectorJob: java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_1462952885071_0009
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:231)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
答案 0 :(得分:0)
知道了。首先,我试图执行runtime/local/bin scripts
,这对集群无效。在这种情况下运行的正确脚本是runtime/deploy/bin
中的脚本。据荷兰维基说。
您在$ NUTCH_HOME / runtime / deploy中找到的Nutch作业jar是自包含的,并附带Nutch所需的所有配置文件,以便能够在任何vanilla Hadoop集群上运行。您所需要的只是一个健康的集群和指向jobtracker的Hadoop环境(集群或本地)。
此外,非常重要的是,为了正确构建分布式模式的nutch,nutch-site.xml配置不应包含 plugin.folders的设置。我的包含
<property>
<name>http.agent.name</name>
<value>Sofia's Nutch Spider</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
<!-- If this is not set to -1, then big pages might not be scanned till the end -->
<property>
<name>http.content.limit</name>
<value>-1</value>
</property>
<property>
<name>file.crawl.parent</name>
<value>false</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:8020</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-master:8032</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value> org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-master:8030</value>
</property>