我有以下形式的rdd:
rdd = sc.parallelize([(2, [199.99, 250.0, 129.99]),
(4, [49.98, 299.95, 150.0, 199.92]),
(8, [179.97, 299.95, 199.92, 50.0]),
(10, [199.99, 99.96, 129.99, 21.99, 199.99]),
(12, [299.98, 100.0, 149.94, 499.95, 250.0])])
我需要将它扁平化为这种形式:
2,199.99
2,250.0
2,12.99
4,49.98
4.299.95
...
它也必须由第一个或第二个字段排序。
如何实现?
感谢。
答案 0 :(得分:0)
您可以像这样使用flatMap:
rdd = sc.parallelize([(2, [199.99, 250.0, 129.99]),
(4, [49.98, 299.95, 150.0, 199.92]),
(8, [179.97, 299.95, 199.92, 50.0]),
(10, [199.99, 99.96, 129.99, 21.99, 199.99]),
(12, [299.98, 100.0, 149.94, 499.95, 250.0])])
print rdd.flatMap(lambda x: [(x[0], y) for y in x[1]])\
.sortBy(lambda x: (x[0], x[1])).collect()
[(2,129.99),(2,199.99),(2,250.0),(4,49.98),(4,150.0),(4, 199.92),(4,299.95),(8,50.0),(8,179.97),(8,199.92),(8,299.95),(10,21.99),(10,99.96),(10,129.99) ,(10,199.99),(10,199.99),(12,100.0),(12,149.94),(12,250.0),(12,299.98),(12,499.95)]