我有一个带有以下元组格式的RDD:
((a, (b,c)), (d, f, g))
我希望按(a, (b,c))
进行分组,并按d
汇总为:
如何在pySpark中完成多个键的分组,在这种情况下哪个函数更优,reduceByKey或aggregateByKey?
答案 0 :(得分:1)
在这种情况下,我连续2个字符串,但数字应该相同。 我注意到你跳过了“e”值
p=["a","b","c","d", "e","f","g"]
def trasforma(p,num):
l=list()
for i in range(0,num):
l.append([j+str(i) for j in p])
return l
x=sc.parallelize(trasforma(p,10)+trasforma(p,10)).map(lambda x: ((x[0], (x[1],x[2])), (x[3],x[5],x[6])))
x.reduceByKey(lambda x,y: (x[0]+y[0], x[1], x[2] )).collect()
--------OUTPUT--------
[(('a5', ('b5', 'c5')), ('d5d5', 'f5', 'g5')),
(('a8', ('b8', 'c8')), ('d8d8', 'f8', 'g8')),
(('a1', ('b1', 'c1')), ('d1d1', 'f1', 'g1')),
(('a0', ('b0', 'c0')), ('d0d0', 'f0', 'g0')),
(('a9', ('b9', 'c9')), ('d9d9', 'f9', 'g9')),
(('a7', ('b7', 'c7')), ('d7d7', 'f7', 'g7')),
(('a2', ('b2', 'c2')), ('d2d2', 'f2', 'g2')),
(('a3', ('b3', 'c3')), ('d3d3', 'f3', 'g3')),
(('a4', ('b4', 'c4')), ('d4d4', 'f4', 'g4')),
(('a6', ('b6', 'c6')), ('d6d6', 'f6', 'g6'))]