Question

我有以下形式的rdd：

rdd = sc.parallelize([(2, [199.99, 250.0, 129.99]),
(4, [49.98, 299.95, 150.0, 199.92]), 
(8, [179.97, 299.95, 199.92, 50.0]), 
(10, [199.99, 99.96, 129.99, 21.99, 199.99]), 
(12, [299.98, 100.0, 149.94, 499.95, 250.0])])

我需要将它扁平化为这种形式：

2,199.99
2,250.0
2,12.99
4,49.98
4.299.95
...

它也必须由第一个或第二个字段排序。

如何实现？

感谢。

Answer 1

您可以像这样使用flatMap：

rdd = sc.parallelize([(2, [199.99, 250.0, 129.99]),
(4, [49.98, 299.95, 150.0, 199.92]), 
(8, [179.97, 299.95, 199.92, 50.0]), 
(10, [199.99, 99.96, 129.99, 21.99, 199.99]), 
(12, [299.98, 100.0, 149.94, 499.95, 250.0])])

print rdd.flatMap(lambda x: [(x[0], y) for y in x[1]])\
.sortBy(lambda x: (x[0], x[1])).collect()

[（2,129.99），（2,199.99），（2,250.0），（4,49.98），（4,150.0），（4， 199.92），（4,299.95），（8,50.0），（8,179.97），（8,199.92），（8,299.95），（10,21.99），（10,99.96），（10,129.99），（10,199.99），（10,199.99），（12,100.0），（12,149.94），（12,250.0），（12,299.98），（12,499.95）]

订单（k，<元组>）RDD

1 个答案: