所以我知道如何将MapReduce用于每个元素都在一行的文件,但是我试图在一个文件中使用MapReduce,其条目如下:
74390,0,6,7,5,2,6,4,10,12,7,6,12,9,4,3,9,1,3,5,9,9,8,5,12,11,4,8,5,9,6,12,12,9,7,9,12,7,8,9,8,8
74391,1,4,2,9,3,5,12,7,6,9,6,8,9,10,12,7,9,9,9,9,5,1,8,4,5,12,6,5,4,3,9,6,8,7,12,11,12,7,8,12,8
74392,0,6,9,3,2,4,9,1,4,7,12,9,12,12,10,6,9,9,5,12,7,12,7,6,8,7,9,5,3,5,9,8,9,12,5,8,4,11,8,6,8
74393,0,8,9,9,7,12,7,12,12,2,9,7,10,7,9,9,9,9,6,4,9,5,6,4,8,8,5,3,5,6,4,1,12,8,12,12,3,8,6,11,5
74394,0,5,9,6,2,4,6,5,6,7,12,8,9,7,9,10,3,9,1,9,8,9,12,7,3,5,12,12,4,12,4,8,9,5,9,12,8,11,6,8,7
74395,1,7,6,7,6,5,2,9,7,1,7,9,12,6,3,9,3,12,10,12,9,9,8,4,12,4,9,6,8,4,9,5,8,12,11,12,8,5,9,8,5
第一个条目是索引,第二个条目对于此分析没有意义,在下面的代码中我将其删除。我的文件有数十万行像这样,我需要弄清楚哪一行最多出现在行的每一部分,因为它们对应于插槽。 预期产出:
0: 1
1: 11
2: 5
...
40: 9
到目前为止我得到了什么:
from mrjob.job import MRJob
from mrjob.step import MRStep
class topPieceSlot(MRJob):
def mapper(self, _, line):
pieces = line.split(',')
pieces = pieces[2::]
for item in range(len(pieces)):
yield str(item)
def reducer(self, pieces):
for slot in range(len(pieces)):
element = str(slot)
numElements = 0
for x in pieces:
total += x
numElements += 1
yield element, numElements
if __name__ == '__main__':
topPieceSlot.run()
它什么也没有回报。它告诉我解压缩需要不止一个值,但我不确定为什么它只获得一个值或者它是否正确开始。我应该使用40个变量吗?这似乎是低效和错误的。