在“低于RDD”中,我想获得不同的值
rdd = sc.parallelize([('id1',['12','12','87']),('id2',['15','17']),
('id3',['20','23','23']),('id4',['20','23','24','26','26','26'])])
下面的数据集
[('id1', ['12', '12', '87']),
('id2', ['15', '17']),
('id3', ['20', '23', '23']),
('id4', ['20', '23', '24', '26', '26', '26'])]
下面的预期结果
[('id1', ['12','87']),
('id2', ['15', '17']),
('id3', ['20', '23']),
('id4', ['20', '23', '24', '26'])]
这是我得到的,但无法正常工作,请帮忙。
rdd.flatMap(lambda x: x).keys().distinct()
我如何实现代码来实现这一目标? 谢谢。
答案 0 :(得分:1)
rdd.mapValues(lambda x: set(x)).take(10)
[
('id1', set(['12', '87'])),
('id2', set(['15', '17'])),
('id3', set(['20', '23'])),
('id4', set(['24', '26', '20', '23']))
]
答案 1 :(得分:0)
请找到以下答案。在scala中,您可以在python中找到类似的API和函数
val rdd = sc.parallelize(Seq(("id1",("12","12","87")),("id2",("15","17")),("id3",("20","23","23")),("id4",("20","23","24","26","26","26"))))
rdd.foreach(println)
// output
//(id1,(12,12,87))
//(id4,(20,23,24,26,26,26))
//(id2,(15,17))
//(id3,(20,23,23))
rdd.mapValues(list => list.productIterator.toSet) // converting into set
OR
rdd.mapValues(list => list.productIterator.toList.distinct)
//(id1,Set(12, 87))
//(id3,Set(20, 23))
//(id2,Set(15, 17))
//(id4,Set(20, 23, 24, 26))