假设我定义了以下RDD:
baseRDD= sc.parallelize( [(0,{'id':1, 'fld1':2.0, 'fld2':3.0}),
(0,{'id':2, 'fld1':4.0, 'fld2':5.0}),
(1,{'id':1, 'fld1':6.0, 'fld2':10.0}),
(1,{'id':2, 'fld1':10.0, 'fld2':15.0}),
(1,{'id':3, 'fld1':20.0, 'fld2':25.0})])
我想把这个结合起来。按键以上的字段以生成此rdd:
[(0,6.0,8.0),(1,36.0,50.0)]
我知道我可以逐场进行如下操作:
fld1RDD = baseRDD.map(lambda x: (x[0],x[1]['fld1'])).\
reduceByKey(lambda x,y: (x+y))
fld2RDD = baseRDD.map(lambda x: (x[0],x[1]['fld2'])).\
reduceByKey(lambda x,y: (x+y))
然后
fld1RDD.join(fld2RDD).collect()
生产
[(0, (6.0, 8.0)), (1, (36.0, 50.0))]
但是有一种更有效的方法来执行此操作,因此代码不必这样做 在baseRDD上进行多次传递?
答案 0 :(得分:1)
您始终可以将数据转换为可以直接汇总的结构,例如Counter
:
from collections import Counter
from operator import add
baseRDD.combineByKey(Counter, add, add).collect()
## [(0, Counter({'fld1': 6.0, 'fld2': 8.0, 'id': 3})),
## (1, Counter({'fld1': 36.0, 'fld2': 50.0, 'id': 6}))]
或NumPy
数组:
from operator import itemgetter
import numpy as np
(baseRDD.combineByKey(
lambda x: np.array(itemgetter("fld1", "fld2")(x)), add, add
).collect()
## [(0, array([ 6., 8.])), (1, array([ 36., 50.]))]