如何编写地图减少以下内容

时间:2012-09-26 12:34:27

标签: python hadoop mapreduce

我有以下数据

Name Year  score
A    1996  84
A    1997  65
A    1996  76
A    1998  78
A    1998  65
B    1998  53
B    1996  98
B    1996  83
B    1996  54

我想要输出如下

Name Year  max_score
A    1996  84
B    1996  98

如何为此作业编写python map reduce代码?

我可以创建NAME和YEAR作为单个键,得分值可以使用。

但还有其他方法可以解决这个问题。

2 个答案:

答案 0 :(得分:2)

假设您的所有年份和分数均为正数:

from collections import defaultdict

mapping = defaultdict( lambda: (0,0) )
with open(datafile) as f:
     for line in f:
         name,year,score = line.split()
         try:
            year = int(year)
            score = int(score)
         except ValueError:
            continue

         if score > mapping[name][1]:
            mapping[name] = year,score

或稍微简洁一点,但对错误不太健壮:

from collections import defaultdict

mapping = defaultdict( lambda: (0,0) )
with open(datafile) as f:
     f.readline() #header.  Don't need it.
     for line in f:
         name,year,score = line.split()
         if int(score) > mapping[name][1]:
            mapping[name] = int(year),int(score)

答案 1 :(得分:0)

这就是你要追求的吗?

def mapper(key, value):
    name, year, score = value.split()
    yield name, (year, score)

def reducer(name, values):
    yield name, max(values, key=operator.itemgetter(1))