如何在Scala Spark中对RDD进行排序?

时间:2014-05-23 21:33:00

标签: scala apache-spark rdd

阅读Spark方法sortByKey:

sortByKey([ascending], [numTasks])   When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

是否可以仅返回“N”个结果。因此,不是返回所有结果,而是返回前10个。我可以将已排序的集合转换为数组并使用take方法,但由于这是一个O(N)操作,是否有更有效的方法?

2 个答案:

答案 0 :(得分:19)

您很可能已经仔细阅读了源代码:

  class OrderedRDDFunctions {
   // <snip>
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P] = {
    val part = new RangePartitioner(numPartitions, self, ascending)
    val shuffled = new ShuffledRDD[K, V, P](self, part)
    shuffled.mapPartitions(iter => {
      val buf = iter.toArray
      if (ascending) {
        buf.sortWith((x, y) => x._1 < y._1).iterator
      } else {
        buf.sortWith((x, y) => x._1 > y._1).iterator
      }
    }, preservesPartitioning = true)
  }

而且,正如您所说,整个数据必须经过随机播放阶段 - 如代码段中所示。

但是,您对随后调用take(K)的担忧可能不那么准确。此操作不会遍历所有N个项目:

  /**
   * Take the first num elements of the RDD. It works by first scanning one partition, and use the
   * results from that partition to estimate the number of additional partitions needed to satisfy
   * the limit.
   */
  def take(num: Int): Array[T] = {

那么,看起来似乎是:

  

O(myRdd.take(K))&lt;&lt; O(myRdd.sortByKey())〜= O(myRdd.sortByKey.take(k))   (至少对于小K)&lt;&lt; O(myRdd.sortByKey()。收集()

答案 1 :(得分:8)

另一个选择,至少来自PySpark 1.2.0,是使用takeOrdered

按升序排列:

rdd.takeOrdered(10)

按降序排列:

rdd.takeOrdered(10, lambda x: -x)

k,v对的前k个值:

rdd.takeOrdered(10, lambda (k, v): -v)