如何分组Spark RDD中的许多键?

时间:2016-11-19 13:41:24

标签: apache-spark group-by rdd

想象一下,我有一个三重奏:

function compareUsrvsCom(){

   var response = "";

   if (number1 === compChoice1) response += "pico  ";
   else if (number1 === compChoice2 || number1 === compChoice3) response += "fermi  ";

   if (number2 === compChoice2) response += "pico  ";
   else if (number2 === compChoice1 || number2 === compChoice3) response += "fermi  ";

   if (number3 === compChoice3) response += "pico  ";
   else if (number3 === compChoice1 || number3 === compChoice2) response += "fermi  ";

   if (number1 === compChoice1 && number2 === compChoice2 && number3 === compChoice3) response += "You win";
   else if (response == "" ) return ("beagls   ")

   return response;
}

如何通过前两个元素有效地对它们进行分组并按第三个元素排序?比如说:

val RecordRDD : RDD[Int, String, Int] = {

                (5 , "x1", 100),
                (3 , "x2", 200),
                (3 , "x4", 300),
                (5 , "x1", 150),
                (3 , "x2", 160),
                (5 , "x1", 400)
  }

我正在寻找一种有效的方法。

我应该将其设为DataFrame并使用GroupBy(Col1,Col2)和SortBy(Col3)吗?

这会比Spark RDD的groupBy更有效吗?

AggregateByKey可以同时聚合2个键吗?

*你可以假设这个RDD非常大!提前致谢。

1 个答案:

答案 0 :(得分:2)

您没有提到您正在运行的Spark版本,但使用RDD执行此操作的一种方法是这样的:

val result = RecordRDD
  .map{case(x, y, z) => ((x,y), List(z))}
  .reduceByKey(_++_)
  .map{case(key, list) => (key._1, Map((key._2 -> list.sorted)))}
  .reduceByKey(_++_)

我不知道它是 最有效的方式,但它非常有效;)