我有以下RDD。
[[1,101,001,100,product1],
[2,102,001,105,product2],
[3,103,002,101,product3]]
预期输出
[('001', ['product1','100'],['product2','105']),('002',['product3','101'])]
答案 0 :(得分:0)
感受节日气氛,所以你去吧:
我认为,项目3&你的嵌套列表中的5应该是字符串......
创建RDD:
ls = [[1,101,"001",100,"product1"],
[2,102,"001",105,"product2"],
[3,103,"002",101,"product3"]]
rdd1 = sc.parallelize(ls)
这样可以rdd1
为:
[[1, 101, '001', 100, 'product1'],
[2, 102, '001', 105, 'product2'],
[3, 103, '002', 101, 'product3']]
<强>映射:强>
# discard items 1 & 2; set item 3 as key
rdd2 = rdd1.map(lambda row: (row[2], [row[4], row[3]]))
rdd2.collect()
> [('001', ['product1', 100]),
> ('001', ['product2', 105]),
> ('002', ['product3', 101])]
# group by key and map values to a list
rdd3 = rdd2.groupByKey().mapValues(list)
rdd3.collect()
> [('001', [['product1', 100], ['product2', 105]]),
> ('002', [['product3', 101]])]
这不是你感兴趣的输出,但是RDD是关键的..