在单次通过中组合Pair RDD的多个字段

时间:2016-04-20 08:51:14

标签: apache-spark

假设我定义了以下RDD:

baseRDD= sc.parallelize( [(0,{'id':1, 'fld1':2.0, 'fld2':3.0}),
                          (0,{'id':2, 'fld1':4.0, 'fld2':5.0}),
                          (1,{'id':1, 'fld1':6.0, 'fld2':10.0}),
                          (1,{'id':2, 'fld1':10.0, 'fld2':15.0}),
                          (1,{'id':3, 'fld1':20.0, 'fld2':25.0})])

我想把这个结合起来。按键以上的字段以生成此rdd:

[(0,6.0,8.0),(1,36.0,50.0)]

我知道我可以逐场进行如下操作:

fld1RDD = baseRDD.map(lambda x: (x[0],x[1]['fld1'])).\
          reduceByKey(lambda x,y: (x+y))
fld2RDD = baseRDD.map(lambda x: (x[0],x[1]['fld2'])).\
          reduceByKey(lambda x,y: (x+y))

然后

  fld1RDD.join(fld2RDD).collect()

生产

  [(0, (6.0, 8.0)), (1, (36.0, 50.0))]

但是有一种更有效的方法来执行此操作,因此代码不必这样做 在baseRDD上进行多次传递?

1 个答案:

答案 0 :(得分:1)

您始终可以将数据转换为可以直接汇总的结构,例如Counter

from collections import Counter
from operator import add

baseRDD.combineByKey(Counter, add, add).collect()
## [(0, Counter({'fld1': 6.0, 'fld2': 8.0, 'id': 3})),
##  (1, Counter({'fld1': 36.0, 'fld2': 50.0, 'id': 6}))]

NumPy数组:

from operator import itemgetter
import numpy as np

(baseRDD.combineByKey(
    lambda x: np.array(itemgetter("fld1", "fld2")(x)), add, add
).collect()
## [(0, array([ 6.,  8.])), (1, array([ 36.,  50.]))]