Question

Task: given some huge unsorted input dataset of RDD[Int], return top 10% as another RDD[Int].

Why is the output type RDD[Int] in the first place? It's because the input is so big so that even the top 10% does not fit into memory, which is the reason I cannot call

sc.makeRDD(input.top(0.1 * input.count()))

as the output would be "collected" to and exhaust driver memory.

This problem is usually handled by sorting the entire input, and then calling some type of limit(). But then this becomes really inefficient as

at least two passes of entire dataset are made. (One to get the size of dataset, and the others involves sorting/...)
I am really interested in the top 10%, not the remaining 90%.

Is there an efficient alternative?

Answer 1

There's a dataframe operation call approxQuantile that could work for you and let's you give the allowable error.

rdd.toDF("num").approxQuantile("num", Seq(0.1), 0.05).rdd

Then anything on the rdd above that belongs approximately to your top 10% with a 5% error.

Efficient way to return top 10% of unsorted RDD as another RDD in Spark?

1 个答案: