我使用Spark 1.3.0和Oozie 4.1.0
我为Spark动作定义了一个Oozie工作流程,如下所示(为了便于阅读,略微精简)。
<action name="sparkler-kicker">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.output.dir</name>
<value>/user/oozie/output/</value>
</property>
</configuration>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>sparkler-kicker</name>
<class>com.sparklet.SparkClientCount</class>
<jar>${nameNode}/apps/${JobName}/${JobStack}/lib/${JobName}.jar</jar>
<spark-opts>... more here ...</spark-opts>
<arg>...args here...</arg>
</spark>
<ok to="mark-job-end"/>
<error to="mark-job-fail"/>
</action>
我想指定一个Spark驱动程序来根据<configuration>
中提供的路径编写输出,例如名称为mapred.output.dir
的属性。我的Spark驱动程序是否可以以编程方式读取这些属性?我似乎无法通过SparkConf或JavaSparkContext.hadoopConfiguration()对象访问它们。根据我发现的其他文档,几乎所有Spark程序都使用<arg>...</arg>
:我没有在<property>
内找到任何阅读<configuration>
的示例。
答案 0 :(得分:0)
I'm also in the process of learning Oozie myself, but I believe you can do it this way:
In your job.properties
file you could define the output dir as a variable:
outputDir=/user/oozie/output
Then in your workflow, refer to that variable in <configuration>
and also pass it in as an additional <arg>
to your Spark app.
<action name="sparkler-kicker">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>sparkler-kicker</name>
<class>com.sparklet.SparkClientCount</class>
<jar>${nameNode}/apps/${JobName}/${JobStack}/lib/${JobName}.jar</jar>
<spark-opts>... more here ...</spark-opts>
<arg>${outputDir}</arg>
<arg>...some more args...</arg>
</spark>
<ok to="mark-job-end"/>
<error to="mark-job-fail"/>
</action>
Update: adding an alternative method to defining the variables in job.properties
, as pointed out by @SamsonScharfrichter in his comment. Instead of defining outputDir
in job.properties
, you can define them instead in your workflow in a parameters
element.
<parameters>
<property>
<name>outputDir</name>
<value>/user/oozie/output</value>
</property>
</parameters>
<action name="sparkler-kicker">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.output.dir</name>
<value>${outputDir}</value>
</property>
</configuration>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>sparkler-kicker</name>
<class>com.sparklet.SparkClientCount</class>
<jar>${nameNode}/apps/${JobName}/${JobStack}/lib/${JobName}.jar</jar>
<spark-opts>... more here ...</spark-opts>
<arg>${outputDir}</arg>
<arg>...some more args...</arg>
</spark>
<ok to="mark-job-end"/>
<error to="mark-job-fail"/>
</action>