在spark数据帧的几个列上替代groupBy

时间:2018-04-05 03:18:40

标签: scala apache-spark group-by

我有一个火花数据框,其中的列如下:

df
--------------------------
A     B     C    D    E    F   amt
"A1"  "B1" "C1" "D1" "E1"  "F1"  1
"A2"  "B2" "C2" "D2" "E2"  "F2"  2

我想用列组合执行groupBy

(A, B, sum(amt))
(A, C, sum(amt))
(A, D, sum(amt))
(A, E, sum(amt))
(A, F, sum(amt))

使结果数据框看起来像:

df_grouped
----------------------
A     field    value   amt
"A1"    "B"     "B1"    1
"A2"    "B"     "B2"    2
"A1"    "C"     "C1"    1
"A2"    "C"     "C2"    2
"A1"    "D"     "D1"    1
"A2"    "D"     "D2"    2

我对此的尝试如下:

val cols = Vector("B","C","D","E","F")
//code for creating empty data frame with structs for the cols A, field, value and act
for (col <- cols){
   empty_df = empty_df.union (df.groupBy($"A",col)
  .agg(sum(amt).as(amt)
  .withColumn("field",lit(col)
  .withColumnRenamed(col, "value"))
}

我觉得&#34;用于&#34;或者&#34; foreach&#34;对于像spark这样的分布式环境来说可能很笨拙。我正在做什么的地图功能有替代方案吗?在我看来,aggregateByKey和collect_list可能有效;但是,我无法想象一个完整的解决方案。请指教。

1 个答案:

答案 0 :(得分:2)

foldLeft是Scala中设计的非常强大的功能,如果您知道如何使用它。我建议你使用foldLeft函数(我已经注释了代码和解释的清晰度

//selecting the columns without A and amt
val columnsForAggregation = df.columns.tail.toSet - "amt"

//creating an empty dataframe (format for final output
val finalDF = Seq(("empty", "empty", "empty", 0.0)).toDF("A", "field", "value", "amt")

//using foldLeft for the aggregation and merging each aggreted results
import org.apache.spark.sql.functions._
val (originaldf, transformeddf) = columnsForAggregation.foldLeft((df, finalDF)){(tempdf, column) => {
  //aggregation on the dataframe with A and one of the column and finally selecting as required in the outptu
  val aggregatedf = tempdf._1.groupBy("A", column).agg(sum("amt").as("amt"))
    .select(col("A"), lit(column).as("field"), col(column).as("value"), col("amt"))
  //union the aggregated results and transferring dataframes for next loop
  (df, tempdf._2.union(aggregatedf))
}
}

//finally removing the dummy row created
transformeddf.filter(col("A") =!= "empty")
  .show(false)

您应该拥有所需的数据框

+---+-----+-----+---+
|A  |field|value|amt|
+---+-----+-----+---+
|A1 |E    |E1   |1.0|
|A2 |E    |E2   |2.0|
|A1 |F    |F1   |1.0|
|A2 |F    |F2   |2.0|
|A2 |B    |B2   |2.0|
|A1 |B    |B1   |1.0|
|A2 |C    |C2   |2.0|
|A1 |C    |C1   |1.0|
|A1 |D    |D1   |1.0|
|A2 |D    |D2   |2.0|
+---+-----+-----+---+

我希望答案很有帮助

以上foldLeft功能的简明形式是

import org.apache.spark.sql.functions._
val (originaldf, transformeddf) = columnsForAggregation.foldLeft((df, finalDF)){(tempdf, column) => 
  (df, tempdf._2.union(tempdf._1.groupBy("A", column).agg(sum("amt").as("amt")).select(col("A"), lit(column).as("field"), col(column).as("value"), col("amt"))))
}