Pyspark - 基于RDD中的键来汇总和汇总

时间:2017-12-27 15:59:48

标签: pyspark aggregate rdd

我有以下RDD。

[[1,101,001,100,product1],
 [2,102,001,105,product2],
 [3,103,002,101,product3]]

预期输出

[('001', ['product1','100'],['product2','105']),('002',['product3','101'])]

1 个答案:

答案 0 :(得分:0)

感受节日气氛,所以你去吧:

我认为,项目3&你的嵌套列表中的5应该是字符串......

创建RDD:

ls = [[1,101,"001",100,"product1"],
 [2,102,"001",105,"product2"],
 [3,103,"002",101,"product3"]]

rdd1 = sc.parallelize(ls)

这样可以rdd1为:

[[1, 101, '001', 100, 'product1'],
 [2, 102, '001', 105, 'product2'],
 [3, 103, '002', 101, 'product3']]

<强>映射:

# discard items 1 & 2; set item 3 as key
rdd2 = rdd1.map(lambda row: (row[2], [row[4], row[3]]))
rdd2.collect() 

> [('001', ['product1', 100]),
>  ('001', ['product2', 105]),
>  ('002', ['product3', 101])]

# group by key and map values to a list
rdd3 = rdd2.groupByKey().mapValues(list)
rdd3.collect()

> [('001', [['product1', 100], ['product2', 105]]), 
>  ('002', [['product3', 101]])]

这不是你感兴趣的输出,但是RDD是关键的..