Python mapreduce示例代码

时间:2015-09-04 05:21:25

标签: python hadoop mapreduce

我正在编写一个简单的MR程序来查找文件中包含单词“Private”的行数。地图阶段运行良好,但减少阶段连续失败。我在这里粘贴代码.... 映射器:

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input) 
# the mapper will get number of records containing word "Private"
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    count = 0
    string = "Private"
    string = string.strip()
    count = count + 1
    # increase counters
    for word in words:
    if word == string:
        # write the results to STDOUT (standard output);
        # what we output here will be go through the shuffle proess and then 
        # be the input for the Reduce step, i.e. the input for reducer.py
                print '%s\t%s' % (string ,count) 

减速机:

#!/usr/bin/env python

from operator import itemgetter
import sys
current_sum = 0

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # parse the input we got from mapper.py
    string, count = line.split('\t', 1)

        try:
        count = float(count)
    except ValueError:
             continue

        current_sum = current_sum + count
            print '%s\t%s' % (string, current_sum)

当作业失败时,我收到以下消息

15/09/04 10:45:02 INFO client.RMProxy:在/0.0.0.0:8032连接到ResourceManager 15/09/04 10:45:02 INFO client.RMProxy:在/0.0.0.0:8032连接到ResourceManager 15/09/04 10:45:03 INFO mapred.FileInputFormat:要处理的总输入路径:1 15/09/04 10:45:03 INFO mapreduce.JobSubmitter:分裂数:2 15/09/04 10:45:03 INFO mapreduce.JobSubmitter:提交工作代币:job_1441341950773_0003 15/09/04 10:45:03 INFO impl.YarnClientImpl:提交的应用程序application_1441341950773_0003 15/09/04 10:45:03 INFO mapreduce.Job:跟踪工作的网址:http://meenal-Vostro-3546:8088/proxy/application_1441341950773_0003/ 15/09/04 10:45:03 INFO mapreduce.Job:正在运行的职位:job_1441341950773_0003 15/09/04 10:45:09 INFO mapreduce.Job:在uber模式下运行的job job_1441341950773_0003:false 15/09/04 10:45:09 INFO mapreduce.Job:地图0%减少0% 15/09/04 10:45:16 INFO mapreduce.Job:任务ID:attempt_1441341950773_0003_m_000001_0,状态:未通过 错误:java.lang.RuntimeException:PipeMapRed.waitOutputThreads():子进程失败,代码为1     在org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)     在org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)     在org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)     在org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)     在org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)     在org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)     在org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)     在org.apache.hadoop.mapred.YarnChild $ 2.run(YarnChild.java:163)     at java.security.AccessController.doPrivileged(Native Method)     在javax.security.auth.Subject.doAs(Subject.java:422)     在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)     在org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

0 个答案:

没有答案