在Pypsark中如何在RDD中获得不同的值

时间:2019-08-29 14:49:37

标签: apache-spark pyspark rdd

在“低于RDD”中,我想获得不同的值

  rdd = sc.parallelize([('id1',['12','12','87']),('id2',['15','17']), 
  ('id3',['20','23','23']),('id4',['20','23','24','26','26','26'])])

下面的数据集

[('id1', ['12', '12', '87']),
 ('id2', ['15', '17']),
 ('id3', ['20', '23', '23']),
 ('id4', ['20', '23', '24', '26', '26', '26'])]

下面的预期结果

[('id1', ['12','87']),
 ('id2', ['15', '17']),
 ('id3', ['20', '23']),
 ('id4', ['20', '23', '24', '26'])]

这是我得到的,但无法正常工作,请帮忙。

 rdd.flatMap(lambda x: x).keys().distinct()

我如何实现代码来实现这一目标? 谢谢。

2 个答案:

答案 0 :(得分:1)

rdd.mapValues(lambda x: set(x)).take(10)

[
('id1', set(['12', '87'])), 
('id2', set(['15', '17'])), 
('id3', set(['20', '23'])), 
('id4', set(['24', '26', '20', '23']))
]

答案 1 :(得分:0)

请找到以下答案。在scala中,您可以在python中找到类似的API和函数

val rdd = sc.parallelize(Seq(("id1",("12","12","87")),("id2",("15","17")),("id3",("20","23","23")),("id4",("20","23","24","26","26","26"))))

rdd.foreach(println)

// output
//(id1,(12,12,87))
//(id4,(20,23,24,26,26,26))
//(id2,(15,17))
//(id3,(20,23,23))

rdd.mapValues(list => list.productIterator.toSet) // converting into set

OR

rdd.mapValues(list => list.productIterator.toList.distinct)

//(id1,Set(12, 87))
//(id3,Set(20, 23))
//(id2,Set(15, 17))
//(id4,Set(20, 23, 24, 26))