使用spark action

时间:2015-09-18 08:37:46

标签: apache-spark oozie pyspark hortonworks-data-platform

我一直在尝试在spark(1.3.1.2.3)中运行python脚本,我正在使用oozie来安排spark工作。我使用Ambari 2.1.1安装了运行HDP 2.3的3节点集群。

我在执行作业时遇到以下错误..

>>> Invoking Main class now >>>

Fetching child yarn jobs
tag id : oozie-9d5f396daac34b4a41fed946fac0472
Child yarn jobs are found - 
Spark Action Main class        : org.apache.spark.deploy.SparkSubmit

Oozie Spark action configuration
=================================================================

                    --master
                    yarn-client
                    --deploy-mode
                    client
                    --name
                    boxplot outlier
                    --class
                    /usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py
                    --executor-memory
                    1G
                    --driver-memory
                    1G
                    --executor-cores
                    4
                    --num-executors
                    2
                    --conf
                    spark.yarn.queue=default
                    --verbose
                    /usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py

=================================================================

>>> Invoking Spark class now >>>

Traceback (most recent call last):
  File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 129, in <module>
    main()
  File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 60, in main
    sc = SparkContext(conf=conf)
  File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 107, in __init__
  File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 155, in _do_init
  File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 201, in _initialize_context
  File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/java_gateway.py", line 701, in __call__
  File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package
    at java.lang.ClassLoader.checkCerts(ClassLoader.java:895)
    at java.lang.ClassLoader.preDefineClass(ClassLoader.java:665)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:758)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at org.eclipse.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:136)
    at org.eclipse.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:129)
    at org.eclipse.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:98)
    at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:98)
    at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:89)
    at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:67)
    at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:60)
    at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:60)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:60)
    at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:66)
    at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:60)
    at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:42)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:53)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

Intercepting System.exit(1)

<<< Invocation of Main class completed <<<

这是我的workflow.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns='uri:oozie:workflow:0.4' name='sparkjob'>
    <start to='spark-process' />
    <action name='spark-process'>
        <spark xmlns='uri:oozie:spark-action:0.1'>
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>oozie.launcher.mapred.job.queue.name</name>
                <value>launcher2</value>
            </property>
            <property>
                <name>oozie.service.SparkConfigurationService.spark.configurations</name>
                <value>spark.eventLog.dir=hdfs://node1.analytics.tardis:8020/user/spark/applicationHistory,spark.yarn.historyServer.address=http://node1.analytics.tardis:18088,spark.eventLog.enabled=true</value>
            </property>
        </configuration>
        <master>yarn-client</master>
        <mode>client</mode>
        <name>boxplot outlier</name>
        <class>/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py</class>
        <jar>/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py</jar>
        <spark-opts>--executor-memory 1G --driver-memory 1G --executor-cores 4 --num-executors 2 --conf spark.yarn.queue=default</spark-opts>
        </spark>
        <ok to='end'/>
        <error to='spark-fail'/>
    </action>
    <kill name='spark-fail'>
        <message>Spark job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>

    <end name='end' />
</workflow-app>

从初始搜索开始,似乎错误弹出是因为打包包含运行spark作业代码的jar文件时存在冲突的依赖关系。 python脚本boxplot_outlier.py不会导入任何可能导致此类冲突的依赖项。

需要一些指导!任何建议都将不胜感激。

编辑:我检查了Oozie Java / Map-Reduce / Pig动作启动器 - 作业配置的Classpath元素,它包括以下两个jar

/hadoop/yarn/local/usercache/ambari-qa/appcache/application_1441804290161_0903/container_e03_1441804290161_0903_01_000002/mr-framework/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar

/hadoop/yarn/local/usercache/ambari-qa/appcache/application_1441804290161_0903/container_e03_1441804290161_0903_01_000002/javax.servlet-3.0.0.v201112011016.jar

SPARK-1693的讨论看来,这两个罐子似乎可能导致这种依赖性冲突。虽然该问题已在1.1.0版本中得到解决。可能存在hadoop 2.7的依赖性问题,或者可能存在一些我缺少的配置。任何帮助将不胜感激

2 个答案:

答案 0 :(得分:1)

终于解决了。原来从hdfs中的oozie sharelib spark目录中删除 javax.servlet-3.0.0.v201112011016.jar 可以缓解此问题。我不确定这是否是解决问题的正确方法,如果这是HDP 2.3.0发行版的配置问题。将报告给HDP人员进行进一步调查。

答案 1 :(得分:1)

在Cloudera CDH 5.5.2上也看到了同样的问题。 我找不到任何关于这是一个已知问题的参考。 从sharelib删除似乎是一个很大的黑客。

为了测试这个理论,从sharelib中删除了javax.servlet-3.0.0.v201112011016.jar并做了一个sharelibupdate(否则Oozie说文件丢失),将javax.servlet-api-3.1.0.jar添加到我的自己的自定义oozie.libpath(也可以在sharelib中,但不想这样做)并且问题消失了。必须有另一种方式。

尽管如此,也可以在这里分享以防万一。