Python MapReduce:来自mapper的空文件

时间:2015-11-08 13:03:55

标签: python hadoop mapreduce

join2_mapper.py

#!/usr/bin/env python
import sys
shows = []
for line in sys.stdin:
    line       = line.strip()   
    key_value  = line.split(',')

    if key_value[1] == 'ABC':
        if key_value[1] not in shows:
            shows.append(key_value[0])

    if key_value[1].isdigit() and (key_value[0] in shows):
        print('{0}\t{1}'.format(key_value[0], key_value[1]) )

示例i / p

Hourly_Sports,DEF
Baked_Games,ABC
Dumb_Talking,ABC
Surreal_Talking,DEF
Cold_Sports,BAT
Hourly_Talking,XYZ
Baked_Talking,CNO
PostModern_Games,ABC
Loud_Talking,DEF
Almost_News,BAT
Hot_Talking,XYZ
Dumb_News,CNO
Surreal_News,ABC
Cold_Talking,DEF
Hourly_Show,BAT
Baked_Show,XYZ
PostModern_Talking,CNO
Loud_Show,ABC
Almost_Cooking,DEF
Hot_News,BAT
Dumb_Cooking,XYZ
Surreal_Cooking,CNO
Cold_News,ABC
Hourly_Sports,DEF
Baked_Sports,BAT
PostModern_Show,XYZ
Loud_Sports,CNO
Almost_Games,ABC
Hot_Cooking,DEF
Dumb_Games,BAT
Surreal_Games,XYZ
Cold_Cooking,CNO
Hourly_Talking,ABC
Baked_Talking,DEF
PostModern_Sports,BAT
Loud_Talking,XYZ
Almost_Talking,CNO
Hot_Games,ABC
Dumb_Talking,DEF
Surreal_Talking,BAT
Cold_Games,XYZ
Hourly_News,CNO
Baked_News,ABC
PostModern_Talking,DEF
Loud_News,BAT
Almost_Show,XYZ
Hot_Talking,CNO
Dumb_Show,ABC
Surreal_Show,DEF
Cold_Talking,BAT
Hourly_Cooking,XYZ
Baked_Cooking,CNO
PostModern_News,ABC
Loud_Cooking,DEF
Almost_Sports,BAT
Hot_Show,XYZ
Dumb_Sports,CNO
Surreal_Sports,ABC
Cold_Show,DEF
Hourly_Games,BAT
Baked_Games,XYZ
PostModern_Cooking,CNO
Loud_Games,ABC
Almost_Talking,DEF
Hot_Sports,BAT
Dumb_Talking,XYZ
Surreal_Talking,CNO
Cold_Sports,ABC
Hourly_Talking,DEF
Baked_Talking,BAT
PostModern_Games,XYZ
Loud_Talking,CNO
Almost_News,ABC
Hot_Talking,DEF
Dumb_News,BAT
Surreal_News,XYZ
Cold_Talking,CNO
Hourly_Show,ABC
Almost_Cooking,855
Baked_Games,991
Baked_News,579  
Baked_Games,200 
Baked_Games,533
Cold_News,590
Hourly_Show,896

$ cat j2.txt | python join2_mapper.py

Baked_Games 991
Baked_News  579
Baked_Games 200
Baked_Games 533
Cold_News   590
Hourly_Show 896

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/join2_data/join2_genchan*.txt -input /user/cloudera/join2_data/join2_gennum*.txt -output /user/cloudera/join2_f1f -mapper /home/cloudera/join2_mapper.py -reducer /home/cloudera/join2_reducer.py -numReduceTasks 0

Map-Reduce Framework
        Map input records=6600
        Map output records=0
        Input split bytes=759
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=4419
        CPU time spent (ms)=9170
        Physical memory (bytes) snapshot=702300160
        Virtual memory (bytes) snapshot=9022578688
        Total committed heap usage (bytes)=364511232
    File Input Format Counters 
        Bytes Read=113055
    File Output Format Counters 
        Bytes Written=0

问题在于输入文件。我实际上有六个输入文件如下:

$ hdfs dfs -ls /user/cloudera/join2_data/join2_gen*.txt
-rw-r--r--   1 cloudera cloudera       1714 2015-11-07 12:24 /user/cloudera/join2_data/join2_genchanA.txt
-rw-r--r--   1 cloudera cloudera       3430 2015-11-07 12:24 /user/cloudera/join2_data/join2_genchanB.txt
-rw-r--r--   1 cloudera cloudera       5152 2015-11-07 12:24 /user/cloudera/join2_data/join2_genchanC.txt
-rw-r--r--   1 cloudera cloudera      17114 2015-11-07 12:24 /user/cloudera/join2_data/join2_gennumA.txt
-rw-r--r--   1 cloudera cloudera      34245 2015-11-07 12:24 /user/cloudera/join2_data/join2_gennumB.txt
-rw-r--r--   1 cloudera cloudera      51400 2015-11-07 12:24 /user/cloudera/join2_data/join2_gennumC.txt

当我将所有文件连接到一个文件并运行该作业时,它正在工作。获得理想的结果。当提供六个块的输入文件时,我得到一个空文件。请指教。

2 个答案:

答案 0 :(得分:0)

仅提供一个-input参数,并将路径传递给包含所有输入数据的文件夹,而不是使用正则表达式。如果你没有使用它,也要移除减速器(只是为了消除杂乱)。我不能确切地说哪一个会解决这个问题(我怀疑它是第一个),但它会解决它。所以:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/join2_data/ \
-output /user/cloudera/join2_f1f \
-mapper /home/cloudera/join2_mapper.py

答案 1 :(得分:0)

您是不是要在key_value[0]中使用if key_value[1] not in shows代替1?