订单(k,<元组>)RDD

时间:2018-05-06 16:00:20

标签: python apache-spark pyspark rdd flatmap

我有以下形式的rdd:

rdd = sc.parallelize([(2, [199.99, 250.0, 129.99]),
(4, [49.98, 299.95, 150.0, 199.92]), 
(8, [179.97, 299.95, 199.92, 50.0]), 
(10, [199.99, 99.96, 129.99, 21.99, 199.99]), 
(12, [299.98, 100.0, 149.94, 499.95, 250.0])])

我需要将它扁平化为这种形式:

2,199.99
2,250.0
2,12.99
4,49.98
4.299.95
...

它也必须由第一个或第二个字段排序。

如何实现?

感谢。

1 个答案:

答案 0 :(得分:0)

您可以像这样使用flatMap:

rdd = sc.parallelize([(2, [199.99, 250.0, 129.99]),
(4, [49.98, 299.95, 150.0, 199.92]), 
(8, [179.97, 299.95, 199.92, 50.0]), 
(10, [199.99, 99.96, 129.99, 21.99, 199.99]), 
(12, [299.98, 100.0, 149.94, 499.95, 250.0])])

print rdd.flatMap(lambda x: [(x[0], y) for y in x[1]])\
.sortBy(lambda x: (x[0], x[1])).collect()
  

[(2,129.99),(2,199.99),(2,250.0),(4,49.98),(4,150.0),(4,   199.92),(4,299.95),(8,50.0),(8,179.97),(8,199.92),(8,299.95),(10,21.99),(10,99.96),(10,129.99) ,(10,199.99),(10,199.99),(12,100.0),(12,149.94),(12,250.0),(12,299.98),(12,499.95)]