Task: given some huge unsorted input dataset of RDD[Int]
, return top 10% as another RDD[Int]
.
Why is the output type RDD[Int]
in the first place? It's because the input is so big so that even the top 10% does not fit into memory, which is the reason I cannot call
sc.makeRDD(input.top(0.1 * input.count()))
as the output would be "collected" to and exhaust driver memory.
This problem is usually handled by sorting the entire input, and then calling some type of limit()
. But then this becomes really inefficient as
Is there an efficient alternative?
答案 0 :(得分:1)
There's a dataframe operation call approxQuantile that could work for you and let's you give the allowable error.
rdd.toDF("num").approxQuantile("num", Seq(0.1), 0.05).rdd
Then anything on the rdd above that belongs approximately to your top 10% with a 5% error.