Spark Scala更快的groupbykey和rdd值排序方法

时间:2018-10-13 21:48:39

标签: scala sorting apache-spark group-by rdd

我有一个rdd,格式为每行(key, (int, double))

我想将rdd转换为(key, ((int, double), (int, double) ...) )

新rdd中的值是按双精度排序的前N个值对

到目前为止,我想出了以下解决方案,但是它确实很慢并且可以永久运行,它在较小的rdd上可以正常工作,但是现在rdd太大了

val top_rated = test_rated.partitionBy(new HashPartitioner(4)).sortBy(_._2._2).groupByKey()
            .mapValues(x => x.takeRight(n))

我想知道是否有更好,更快的方法?

2 个答案:

答案 0 :(得分:0)

最有效的方法可能是aggregateByKey

type K = String
type V = (Int, Double)
val rdd: RDD[(K, V)] = ???

//TODO: implement a function that adds a value to a sorted array and keeps top N elements. Returns the same array
def addToSortedArray(arr: Array[V], newValue: V): Array[V] = ??? 
//TODO: implement a function that merges 2 sorted arrays and keeps top N elements. Returns the first array
def mergeSortedArrays(arr1: Array[V], arr2: Array[V]): Array[V] = ??? //TODO

val result: RDD[(K, Array[(Int, Double)])] = rdd.aggregateByKey(zeroValue = new Array[V](0))(seqOp = addToSortedArray, combOp = mergeSortedArrays)

答案 1 :(得分:0)

由于您只对RDD中的前N个值感兴趣,因此建议您避免对整个RDD进行排序。此外,请尽可能使用性能更高的reduceByKey而不是groupByKey。以下是使用blogtopN方法的示例:

def topN(n: Int, list: List[(Int, Double)]): List[(Int, Double)] = {
  def bigHead(l: List[(Int, Double)]): List[(Int, Double)] = list match {
    case Nil => list
    case _ => l.tail.foldLeft( List(l.head) )( (acc, x) =>
        if (x._2 <= acc.head._2) x :: acc else acc :+ x
      )
  }
  def update(l: List[(Int, Double)], e: (Int, Double)): List[(Int, Double)] = {
    if (e._2 > l.head._2) bigHead((e :: l.tail)) else l
  }
  list.drop(n).foldLeft( bigHead(list.take(n)) )( update ).sortWith(_._2 > _._2)
}

val rdd = sc.parallelize(Seq(
  ("a", (1, 10.0)), ("a", (4, 40.0)), ("a", (3, 30.0)), ("a", (5, 50.0)), ("a", (2, 20.0)),
  ("b", (3, 30.0)), ("b", (1, 10.0)), ("b", (4, 40.0)), ("b", (2, 20.0))
))

val n = 2

rdd.
  map{ case (k, v) => (k, List(v)) }.
  reduceByKey{ (acc, x) => topN(n, acc ++ x) }.
  collect
// res1:  Array[(String, List[(Int, Double)])] =
//   Array((a,List((5,50.0), (4,40.0))), (b,List((4,40.0), (3,30.0)))))