使用Spark检索Oozie工作流中的属性

时间:2015-12-09 20:38:20

标签: apache-spark oozie

我使用Spark 1.3.0和Oozie 4.1.0

我为Spark动作定义了一个Oozie工作流程,如下所示(为了便于阅读,略微精简)。

 <action name="sparkler-kicker">
    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.output.dir</name>
                <value>/user/oozie/output/</value>
            </property>
        </configuration>
        <master>yarn-cluster</master>
        <mode>cluster</mode>
        <name>sparkler-kicker</name>
        <class>com.sparklet.SparkClientCount</class>
      <jar>${nameNode}/apps/${JobName}/${JobStack}/lib/${JobName}.jar</jar>
        <spark-opts>... more here ...</spark-opts>
        <arg>...args here...</arg>
    </spark>
    <ok to="mark-job-end"/>
    <error to="mark-job-fail"/>
</action>

我想指定一个Spark驱动程序来根据<configuration>中提供的路径编写输出,例如名称为mapred.output.dir的属性。我的Spark驱动程序是否可以以编程方式读取这些属性?我似乎无法通过SparkConf或JavaSparkContext.hadoopConfiguration()对象访问它们。根据我发现的其他文档,几乎所有Spark程序都使用<arg>...</arg>:我没有在<property>内找到任何阅读<configuration>的示例。

1 个答案:

答案 0 :(得分:0)

I'm also in the process of learning Oozie myself, but I believe you can do it this way:

In your job.properties file you could define the output dir as a variable:

outputDir=/user/oozie/output

Then in your workflow, refer to that variable in <configuration> and also pass it in as an additional <arg> to your Spark app.

 <action name="sparkler-kicker">
    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.output.dir</name>
                <value>${outputDir}</value>
            </property>
        </configuration>
        <master>yarn-cluster</master>
        <mode>cluster</mode>
        <name>sparkler-kicker</name>
        <class>com.sparklet.SparkClientCount</class>
      <jar>${nameNode}/apps/${JobName}/${JobStack}/lib/${JobName}.jar</jar>
        <spark-opts>... more here ...</spark-opts>
     <arg>${outputDir}</arg>   
     <arg>...some more args...</arg>
    </spark>
    <ok to="mark-job-end"/>
    <error to="mark-job-fail"/>
</action>

Update: adding an alternative method to defining the variables in job.properties, as pointed out by @SamsonScharfrichter in his comment. Instead of defining outputDir in job.properties, you can define them instead in your workflow in a parameters element.

<parameters>
    <property>
        <name>outputDir</name>
        <value>/user/oozie/output</value>
    </property>
</parameters>
<action name="sparkler-kicker">
    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.output.dir</name>
                <value>${outputDir}</value>
            </property>
        </configuration>
        <master>yarn-cluster</master>
        <mode>cluster</mode>
        <name>sparkler-kicker</name>
        <class>com.sparklet.SparkClientCount</class>
      <jar>${nameNode}/apps/${JobName}/${JobStack}/lib/${JobName}.jar</jar>
        <spark-opts>... more here ...</spark-opts>
     <arg>${outputDir}</arg>   
     <arg>...some more args...</arg>
    </spark>
    <ok to="mark-job-end"/>
    <error to="mark-job-fail"/>
</action>