Spark作业与spark-submit一起使用但在oozie作业中失败

时间:2016-09-08 12:05:43

标签: python apache-spark pyspark oozie

使用Spark 2.0.0,Oozie 4.2.0。我试图用oozie运行一个spark工作并得到了这个错误:

  File "/mnt/yarn/usercache/hadoop/appcache/application_1473318730987_0107/container_1473318730987_0107_02_000001/pyspark.zip/pyspark/sql/context.py", line 481, in __init__
  File "/mnt/yarn/usercache/hadoop/appcache/application_1473318730987_0107/container_1473318730987_0107_02_000001/pyspark.zip/pyspark/sql/session.py", line 177, in getOrCreate
  File "/mnt/yarn/usercache/hadoop/appcache/application_1473318730987_0107/container_1473318730987_0107_02_000001/pyspark.zip/pyspark/sql/session.py", line 211, in __init__
TypeError: 'JavaPackage' object is not callable

pyspark/sqlsession.py正试图实例化sc._jvm.SparkSession,但它不是一个类,所以它失败了。它与spark-submit一起工作,所以我写了一个简单的脚本来看看有什么不同,get_session.py

#!/usr/bin/env python

from pyspark import SparkContext
sc = SparkContext()
print "sc._jvm.SparkSession:", sc._jvm.SparkSession

使用spark-submit

运行时
$ spark-submit --master yarn --deploy-mode cluster get_session.py
...
sc._jvm.SparkSession <py4j.java_gateway.JavaClass object at  0x7f7e8194f850>
...

从oozie工作流程调用时:

<workflow-app name="testing" xmlns="uri:oozie:workflow:0.4">
  <start to="initSystem"/>

  <action name="initSystem">
    <spark xmlns="uri:oozie:spark-action:0.1">
      <job-tracker>${jobTracker}</job-tracker>
      <name-node>${nameNode}</name-node>
      <configuration> 
        <property>
          <name>oozie.launcher.yarn.app.mapreduce.am.env</name>
          <value>SPARK_HOME=/usr/lib/spark/</value>
        </property>
        <property>  
          <name>oozie.launcher.mapred.child.env</name> 
          <value>PYSPARK_ARCHIVES_PATH=pyspark.zip</value>
        </property>
      </configuration>
      <master>yarn</master>
      <mode>cluster</mode>
      <name>testing</name>
      <class></class>
      <jar>${workflowPath}/get_session.py</jar>
      <spark-opts>--py-files py4j-src.zip,pyspark.zip</spark-opts>
    </spark>
    <ok to="end"/>
    <error to="end"/>
  </action>

  <end name="end"/>
</workflow-app>

输出结果为:

sc._jvm.SparkSession: <py4j.java_gateway.JavaPackage object at 0x7fc8eb1f8b50>

请注意sc._jvm.SparkSession在第一种情况下是py4j.java_gateway.JavaClass(没关系),但在第二种情况下是py4j.java_gateway.JavaPackage;这不好,这是当该对象名称不可用时返回的通用对象。

有什么想法吗?所有这些都适用于Spark 1.6.0,但那里没有SparkSession

0 个答案:

没有答案