虽然python脚本在本地运行,但在运行MapReduce代码时出错

时间:2015-02-14 07:22:09

标签: python hadoop mapreduce

我有一个mapper.py和reducer.py来处理一个输入文件,它只是一个普通的linux文件,格式如下:

ID \t time \t duration \t Description \t status

基本上我想在reducer中对我的ID进行分组,所以我构建了mapper,如下所示:

#!/usr/bin/env python

import sys
import re

for line in sys.stdin:
    #remove leading and trailing whitespace
    line = line.strip()
    #split the line into portions
    portions = re.split(r'\t+',line)
    #take the first column (which is block number) to emit as key
    block = portions[0]
    print '%s\t%s\t%s\t%s\t%s' % (block,portions[1],portions[2],portions[3],portions[4])

然后在reducer中,我将进行如下的数据处理:

#!/usr/bin/env python

from operator import itemgetter
import sys

bitmapStr=""
current_block=None
block=start=duration=precision=status=""
round=0 #interval is every 11 mins or 660 seconds

for line in sys.stdin:
    line=line.strip()
    block,start,duration,precision,status=line.split('\t')

    if current_block == block:
            duration = int(duration)
            while round < duration:
                    if(status.islower()):
                            bitmapStr=bitmapStr+"1"
                    else:
                            bitmapStr=bitmapStr+"0"
                    round = round + 660

            #amount of time exceed this block record
            round = round - duration

    else:
            if current_block:
                    print '%s\t%s' % (current_block,bitmapStr)
            round=0
            bitmapStr=""
            current_block=block
            duration = int(duration)
            while round < duration:
                    if(status.islower()):
                            bitmapStr=bitmapStr+"1"
                    else:
                            bitmapStr=bitmapStr+"0"
                    round = round + 660
            #amount of time exceed this block record
            round = round - duration

if current_block == block:
    print '%s\t%s'  % (current_block,bitmapStr)

我通过以下方式在本地测试了mapper和reducer:

cat small_data_sample | ./mapper.py | sort -k1,1 | ./reducer.py
#output is working as I expect

但是,当我尝试通过Hadoop运行它时,会产生以下错误:

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

运行hadoop的确切命令如下:

bin/hadoop jar hadoop-streaming.jar \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-D mapred.text.key.partitioner.options='-k1,1'  \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -D mapred.text.key.comparator.options='-k1,1 -k2,2n' \
-D stream.num.map.output.key.fields=2 \
-input $hadoop_dir/data/sample \
-output $hadoop_dir/data/data_test1-output \
-mapper $dir/calBitmap_mapper.py \
-reducer $dir/calBitmap_reducer.py \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

其中$ hadoop_dir是我的hdfs位置的路径,$ dir是我的mapper和reducer python脚本的位置。

请告诉我纠正错误所需的内容。提前谢谢!

*编辑:我尝试使用不同的输入文件(尺寸小得多),它似乎工作正常。因此我不知道为什么对于大输入文件,MapReduce会中断

1 个答案:

答案 0 :(得分:0)

我找到了错误的解决方案。在映射器中,我没有特别注意不同类型的输入。我的一些输入有前几行是注释,因此部分数组由于索引超出界限而失败。为解决这个问题,我添加了一张支票:

if len(portions) == 5: #make sure it has 5 elements in there
    print '%s\t%s\t%s\t%s\t%s' % (block,portions[1],portions[2],portions[3],portions[4])