hadoop流式传输工作流程多个文件

时间:2015-01-24 00:54:13

标签: hadoop awk mapreduce

我正在尝试编写一个带有执行awk程序的hadoop流式操作的工作流程,下面是我的方案

Hadoop流命令在客户端工作正常。但是,当执行oozie工作流时,它无法正常工作,因为它无法找到第二个文件。请注意,awk脚本位于本地主目录中,该目录也安装在hadoop上,输入路径位于HDFS上 在sample.awk(下面附带的代码)中,我传递两个变量$ 1和$ 2,它们应该从file1和file2获取数据

从CLI,我还附加了我从hue配置的流式工作流程,该工作流程未按预期工作。



/usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.3.0-mr1-cdh5.1.0.jar -D mapreduce.job.reduces=0 -D mapred.reduce.tasks=0 -input /user/cloudera/input/file1 /user/cloudera/input/file2 -output /user/cloudera/awk/ouput -mapper /home/cloudera/diff_files/op_code/sample.awk -file /home/cloudera/diff_files/op_code/sample.awk




Workflow.xml
------------------


<workflow-app name="awk" xmlns="uri:oozie:workflow:0.4">
  <global>
            <configuration>
                <property>
                    <name></name>
                    <value></value>
                </property>
            </configuration>
  </global>
    <start to="awk-streaming"/>
    <action name="awk-streaming" cred="">
        <map-reduce>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <streaming>
                <mapper>/home/clouderasample.awk</mapper>
                <reducer>/home/clouderasample.awk</reducer>
            </streaming>
            <configuration>
                <property>
                    <name>mapred.output.dir</name>
                    <value>/user/cloudera/awk/output</value>
                </property>
                <property>
                    <name>oozie.use.system.libpath</name>
                    <value>true</value>
                </property>
                <property>
                    <name>mapred.input.dir</name>
                    <value>/user/cloudera/awk/input</value>
                </property>
                </configuration>
            <file>/user/cloudera/awk/input/file1#file1</file>
            <file>/user/cloudera/awk/input/file2#file2</file>
        </map-reduce>
        <ok to="end"/>
        <error to="kill"/>
    </action>
    <kill name="kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

1 个答案:

答案 0 :(得分:1)

请查看此链接了解更多详情 http://wiki.apache.org/hadoop/JobConfFile

<property>
<name>mapred.input.dir</name>
<value>/user/cloudera/awk/input/file1,/user/cloudera/awk/input/file2</value>
<description>A comma separated list of input directories.</description>
</property>