在使用mrjob转换为MapReduce时,无法理解标准Python代码产生意外结果的原因。
来自.txt文件的示例数据:
1 12
1 14
1 15
1 16
1 18
1 12
2 11
2 11
2 13
3 12
3 15
3 11
3 10
此代码创建一个字典并执行简单的除法计算:
dic = {}
with open('numbers.txt', 'r') as fi:
for line in fi:
parts = line.split()
dic.setdefault(parts[0],[]).append(int(parts[1]))
print(dic)
for k, v in dic.items():
print (k, 1/len(v), v)
结果:
{'1': [12, 14, 15, 16, 18, 12], '2': [11, 11, 13], '3': [12, 15, 11, 10]}
1 0.16666666666666666 [12, 14, 15, 16, 18, 12]
2 0.3333333333333333 [11, 11, 13]
3 0.25 [12, 15, 11, 10]
但是当使用mrjob转换为MapReduce时:
from mrjob.job import MRJob
from mrjob.step import MRStep
from collections import defaultdict
class test(MRJob):
def steps(self):
return [MRStep(mapper=self.divided_vals)]
def divided_vals(self, _, line):
dic = {}
parts = line.split()
dic.setdefault(parts[0],[]).append(int(parts[1]))
for k, v in dic.items():
yield (k, 1/len(v)), v
if __name__ == '__main__':
test.run()
结果:
["2", 1.0] [11]
["2", 1.0] [13]
["3", 1.0] [12]
["3", 1.0] [15]
["3", 1.0] [11]
["3", 1.0] [10]
["1", 1.0] [12]
["1", 1.0] [14]
["1", 1.0] [15]
["1", 1.0] [16]
["1", 1.0] [18]
["1", 1.0] [12]
["2", 1.0] [11]
为什么没有MapReduce组并以同样的方式计算?如何在MapReduce中重新创建标准Python结果?