我将AWS EMR 5.7.0与Oozie版本4.3.0和spark版本2.1.1一起使用
我用Scala编写了一个简单的Spark程序。使用spark-submit从shell执行时工作正常。
但是当我尝试使用Oozie Spark动作执行此程序时,我遇到了错误。
nameNode=hdfs://ip-xx-xx-xx-xx.ec2.internal:8020
jobTracker=ip-xx-xx-xx-xx.ec2.internal:8032
master=local
oozie.use.system.libpath=true
oozie.wf.application.path=hdfs://ip-xx-xx-xx-xx.ec2.internal:8020/test-artifacts/
oozie.action.sharelib.for.spark = spark2
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.5" name="Test program">
<start to="spark-node" />
<action name="spark-node">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>${master}</master>
<name>Spark on Oozie - test job</name>
<class>TestPackage.TestObj</class>
<jar>/home/hadoop/oozie-test.jar</jar>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Workflow failed, error message]</message>
</kill>
<end name="end" />
</workflow-app>
Workflow.xml保存在HDFS中,job.properties位于主节点中。 当使用命令&#34; oozie job -oozie http:/ /ip-xx-xx-xx-xx.ec2.internal:11000/oozie -config job.properties -run&#34;执行Oozie作业时, map-reduce程序启动了。没有启动spark作业,Mapreduce作业失败并出现错误。
1)对于Sparkmaster = yarn-cluster和mode = cluster,获得以下异常。
Log file: /mnt/yarn/usercache/hadoop/appcache/application_1502719828530_0011/container_1502719828530_0011_01_000001/spark-oozie-job_1502719828530_0011.log not present. Therefore no Hadoop job IDs found.
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, null
java.lang.NullPointerException
at java.io.File.<init>(File.java:277)
at org.apache.spark.deploy.yarn.Client.addDistributedUri$1(Client.scala:416)
at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:454)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:580)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:579)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:579)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:578)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:578)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:814)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:169)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1091)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1150)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:340)
at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:259)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:60)
at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:80)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:234)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:380)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:301)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:187)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:230)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2)(master = yarn或master = local [*]或master = local [1]或master = local),mode =&#39; client&#39;得到如下错误。
Log file: /mnt/yarn/usercache/hadoop/appcache/application_1502719828530_0013/container_1502719828530_0013_01_000001/spark-oozie-job_1502719828530_0013.log not present. Therefore no Hadoop job IDs found.
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, No FileSystem for scheme: org.apache.spark
java.io.IOException: No FileSystem for scheme: org.apache.spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2708)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2715)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
at org.apache.spark.deploy.SparkSubmit$.downloadFile(SparkSubmit.scala:865)
at org.apache.spark.deploy.SparkSubmit$$anonfun$downloadFileList$2.apply(SparkSubmit.scala:850)
at org.apache.spark.deploy.SparkSubmit$$anonfun$downloadFileList$2.apply(SparkSubmit.scala:850)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.SparkSubmit$.downloadFileList(SparkSubmit.scala:850)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$2.apply(SparkSubmit.scala:317)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$2.apply(SparkSubmit.scala:317)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:317)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:340)
at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:259)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:60)
at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:80)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:234)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:380)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:301)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:187)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:230)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
===============================
从链接https://issues.apache.org/jira/plugins/servlet/mobile#issue/OOZIE-2767,Oozie尚未支持spark2动作。
但根据链接https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_spark-component-guide/content/ch_oozie-spark-action.html,似乎有解决方法。
除了Hortonworks链接外,我还遵循了https://aws.amazon.com/blogs/big-data/use-apache-oozie-workflows-to-automate-apache-spark-jobs-and-more-on-amazon-emr/
中提到的所有步骤但到目前为止还没有运气。 我无法找到任何证明Oozie + Spark 2受支持或不受支持的文档。 如果它适用于任何人,请提供有关如何让Oozie + Spark2在AWS EMR中工作的详细步骤。