非常愚蠢的问题.. 我有以下数据
id1, value
1, 20.2
1,20.4
....
我想找到id1的平均值和中位数? (注意..均值,每个id的中位数而不是全局均值,中位数) 我正在使用python hadoop流..
mapper.py
for line in sys.stdin:
try:
# remove leading and trailing whitespace
line = line.rstrip(os.linesep)
tokens = line.split(",")
print '%s,%s' % (tokens[0],tokens[1])
except Exception:
continue
reducer.py
data_dict = defaultdict(list)
def mean(data_list):
return sum(data_list)/float(len(data_list)) if len(data_list) else 0
def median(mylist):
sorts = sorted(mylist)
length = len(sorts)
if not length % 2:
return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
return sorts[length / 2]
for line in sys.stdin:
try:
line = line.rstrip(os.linesep)
serial_id, duration = line.split(",")
data_dict[serial_id].append(float(duration))
except Exception:
pass
for k,v in data_dict.items():
print "%s,%s,%s" %(k, mean(v), median(v))
我期待一个单一的平均值,每个键的中位数 但是我看到id1重复了不同的均值和中位数.. 例如..做grep ..
mean_median/part-00003:SH002616940000,5.0,5.0
mean_median/part-00008:SH002616940000,901.0,901.0
mean_median/part-00018:SH002616940000,11.0,11.0
mean_median/part-00000:SH002616940000,2.0,2.0
mean_median/part-00025:SH002616940000,1800.0,1800.0
mean_median/part-00002:SH002616940000,4.0,4.0
mean_median/part-00006:SH002616940000,8.0,8.0
mean_median/part-00021:SH002616940000,14.0,14.0
mean_median/part-00001:SH002616940000,3.0,3.0
mean_median/part-00022:SH002616940000,524.666666667,26.0
mean_median/part-00017:SH002616940000,65.0,65.0
mean_median/part-00016:SH002616940000,1384.0,1384.0
mean_median/part-00020:SH002616940000,596.0,68.0
mean_median/part-00014:SH002616940000,51.0,51.0
mean_median/part-00004:SH002616940000,6.0,6.0
mean_median/part-00005:SH002616940000,7.0,7.0
有什么建议吗?
答案 0 :(得分:1)
我在hadoop-user邮件列表中回答了同样的问题如下:
你为这份工作开始了多少减速机? 如果您为此作业启动了许多Reducers,它将生成多个输出文件,命名为part- * 。 每个部分只是特定Reducer分区的局部均值和中值。
两种解决方案: 1,调用setNumReduceTasks(1)的方法将Reducer数设置为1,它只产生一个输出文件,每个不同的密钥只产生一个均值和中值。 2,在Hadoop源代码中引用org.apache.hadoop.examples.WordMedian。它通过本地函数处理由多个Reducer生成的所有输出文件,并产生最终结果。