我正在尝试编写一个带有执行awk程序的hadoop流式操作的工作流程,下面是我的方案
Hadoop流命令在客户端工作正常。但是,当执行oozie工作流时,它无法正常工作,因为它无法找到第二个文件。请注意,awk脚本位于本地主目录中,该目录也安装在hadoop上,输入路径位于HDFS上 在sample.awk(下面附带的代码)中,我传递两个变量$ 1和$ 2,它们应该从file1和file2获取数据
/usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.0.jar -D mapreduce.job.reduces=0 -D mapred.reduce.tasks=0 -input /user/cloudera/input/file1 /user/cloudera/input/file2 -output /user/cloudera/awk/ouput -mapper /home/cloudera/diff_files/op_code/sample.awk -file /home/cloudera/diff_files/op_code/sample.awk

Workflow.xml
------------------
<workflow-app name="awk" xmlns="uri:oozie:workflow:0.4">
<global>
<configuration>
<property>
<name></name>
<value></value>
</property>
</configuration>
</global>
<start to="awk-streaming"/>
<action name="awk-streaming" cred="">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<streaming>
<mapper>/home/clouderasample.awk</mapper>
<reducer>/home/clouderasample.awk</reducer>
</streaming>
<configuration>
<property>
<name>mapred.output.dir</name>
<value>/user/cloudera/awk/output</value>
</property>
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/cloudera/awk/input</value>
</property>
</configuration>
<file>/user/cloudera/awk/input/file1#file1</file>
<file>/user/cloudera/awk/input/file2#file2</file>
</map-reduce>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
答案 0 :(得分:1)
请查看此链接了解更多详情 http://wiki.apache.org/hadoop/JobConfFile
<property>
<name>mapred.input.dir</name>
<value>/user/cloudera/awk/input/file1,/user/cloudera/awk/input/file2</value>
<description>A comma separated list of input directories.</description>
</property>