Question

我正在寻找像top或takeOrdered这样的Spark RDD操作，但这会返回另一个RDD，而不是一个数组，也就是说，不会将完整的结果收集到RAM中。

它可以是一系列操作，但理想情况下，无需尝试将完整结果收集到单个节点的内存中。

Answer 1

假设你想获得RDD的前50％。

def top50(rdd: RDD[(Double, String)]) = {
  val sorted = rdd.sortByKey(ascending = false)
  val partitions = sorted.partitions.size
  // Throw away the contents of the lower partitions.
  sorted.mapPartitionsWithIndex { (pid, it) =>
    if (pid <= partitions / 2) it else Nil
  }
}

这是近似值 - 您可能会获得多于或少于50％。你可以做得更好但是需要额外评估RDD。对于我想到的用例，这不值得。

Answer 2

看看

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala

import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
val rdd: RDD[(String, Int)] // the first string is the key, the rest is the value

val topByKey:RDD[(String, Array[Int])] = rdd.topByKey(n)

或将aggregate与BoundedPriorityQueue一起使用。

Spark RDD操作就像top返回一个较小的RDD

2 个答案: