Question

从this guide开始，我成功地进行了示例练习。但是在运行我的mapreduce工作时，我收到以下错误 ERROR streaming.StreamJob: Job not Successful! 10/12/16 17:13:38 INFO streaming.StreamJob: killJob... Streaming Job Failed!
日志文件中的错误

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Mapper.py

import sys

i=0

for line in sys.stdin:
    i+=1
    count={}
    for word in line.strip().split():
        count[word]=count.get(word,0)+1
    for word,weight in count.items():
        print '%s\t%s:%s' % (word,str(i),str(weight))

Reducer.py

import sys

keymap={}
o_tweet="2323"
id_list=[]
for line in sys.stdin:
    tweet,tw=line.strip().split()
    #print tweet,o_tweet,tweet_id,id_list
    tweet_id,w=tw.split(':')
    w=int(w)
    if tweet.__eq__(o_tweet):
        for i,wt in id_list:
            print '%s:%s\t%s' % (tweet_id,i,str(w+wt))
        id_list.append((tweet_id,w))
    else:
        id_list=[(tweet_id,w)]
        o_tweet=tweet

[edit]命令来运行作业：

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input my-input/* -output my-output

输入是任意随机的句子序列。

谢谢，

Answer 1

您的-mapper和-reducer应该只是脚本名称。

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input my-input/* -output my-output

当您的脚本位于hdfs中另一个文件夹中的作业中时，该作业相对于执行为“。”的尝试任务。（仅供参考，如果您想要另一个文件，例如查找表，您可以在Python中打开它，就好像它与您的脚本在M / R作业中的脚本位于同一目录中一样）

还要确保你有chmod a + x mapper.py和chmod a + x reducer.py

Answer 2

尝试添加

 #!/usr/bin/env python

您的脚本顶部。

或者，

-mapper 'python m.py' -reducer 'r.py'

Answer 3

我最近遇到了这个错误，我的问题变成了明显的（事后看来）和其他解决方案一样：

我的Python代码中只有一个错误。（就我而言，我使用的是Python v2.7字符串格式，而我所使用的AWS EMR集群使用的是Python v2.6。）

要查找实际的Python错误，请转到Job Tracker Web UI（对于AWS EMR，AMI 2.x的端口9100和AMI 3.x的端口9026）;找到失败的映射器;打开它的日志;并读取stderr输出。

Answer 4

确保您的输入目录仅包含正确的文件

Answer 5

我也有同样的问题我尝试了马文·W的解决方案并且我还安装了spark，请确保您已经安装了spark，不仅安装了pyspark（dependency），还安装了框架installtion tutorial

遵循该教程

Answer 6

您需要明确指示将mapper和reducer用作python脚本，因为我们有多个流选项。您可以使用单引号或双引号。

-mapper "python mapper.py" -reducer "python reducer.py"

或

-mapper 'python mapper.py' -reducer 'python reducer.py'

完整命令如下：

hadoop jar /path/to/hadoop-mapreduce/hadoop-streaming.jar \
-input /path/to/input \
-output /path/to/output \
-mapper 'python mapper.py' \
-reducer 'python reducer.py' \
-file /path/to/mapper-script/mapper.py \
-file /path/to/reducer-script/reducer.py

Answer 7

如果您在 hadoop cluster 中运行此命令，请确保在每个 NodeMnager 实例中都安装了 python。 #hadoop

python中的Hadoop Streaming Job失败错误

7 个答案: