我试图在emr上使用oozie执行spark提交时收到以下错误

时间:2017-02-08 14:37:54

标签: apache-spark oozie emr spark-submit

我正在群集模式下运行。 apacheds-kerberos-codec-2.0.0-M15.jar存在于oozie / share / lib / lib * / spark和oozie / share / lib / lib * / oozie中的多个位置。这是一个环境问题吗?

ava.lang.IllegalArgumentException: Attempt to add (hdfs://ip-172-20-10-53.ec2.internal:8020/user/oozie/share/lib/lib_20170208121307/oozie/apacheds-kerberos-codec-2.0.0-M15.jar) multiple times to the distributed cache.
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:608)
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:599)
    at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
    at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
    at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:868)
    at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:170)
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1154)
    at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
    at org.apache.spark.deploy.yarn.Client.main(Client.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:338)
    at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:257)
    at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:60)
    at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:78)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:232)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
    at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:380)
    at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:301)
    at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:187)
    at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:230)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

2 个答案:

答案 0 :(得分:0)

oozie sharelib和spark sharelib目录似乎共享相同的jar,运行spark工作流会导入两个目录,而hadoop-3代码库并不喜欢。

我必须将oozie sharelib目录重新组织为只有oozie特定的jar,以便oozie和spark sharelib dirs之间没有重复:

export HADOOP_USER_NAME=oozie
hdfs dfs -mv /user/oozie/share/lib/lib_20170222143042/oozie /user/oozie/share/lib/lib_20170222143042/oozie.old
hdfs dfs -mkdir /user/oozie/share/lib/lib_20170222143042/oozie
hdfs dfs -cp /user/oozie/share/lib/lib_20170222143042/oozie.old/oozie-hadoop-utils-hadoop-2-4.3.0.jar /user/oozie/share/lib/lib_20170222143042/oozie
hdfs dfs -cp /user/oozie/share/lib/lib_20170222143042/oozie.old/oozie-sharelib-oozie-4.3.0.jar /user/oozie/share/lib/lib_20170222143042/oozie

这解决了能够从oozie运行spark工作流程的直接问题,但我不确定这是否会影响非火花工作流程。

答案 1 :(得分:0)

我有一份oozie工作,启动了一项火花工作,在Amazon EMR中运行。当EMR Hadoop设置有一个主服务器实例和一个服务器实例时,我得到了同样的错误。当我将奴隶的实例数量增加到两个时,一切都按预期工作。