我有一个火花数据框,其中的列如下:
df
--------------------------
A B C D E F amt
"A1" "B1" "C1" "D1" "E1" "F1" 1
"A2" "B2" "C2" "D2" "E2" "F2" 2
我想用列组合执行groupBy
(A, B, sum(amt))
(A, C, sum(amt))
(A, D, sum(amt))
(A, E, sum(amt))
(A, F, sum(amt))
使结果数据框看起来像:
df_grouped
----------------------
A field value amt
"A1" "B" "B1" 1
"A2" "B" "B2" 2
"A1" "C" "C1" 1
"A2" "C" "C2" 2
"A1" "D" "D1" 1
"A2" "D" "D2" 2
我对此的尝试如下:
val cols = Vector("B","C","D","E","F")
//code for creating empty data frame with structs for the cols A, field, value and act
for (col <- cols){
empty_df = empty_df.union (df.groupBy($"A",col)
.agg(sum(amt).as(amt)
.withColumn("field",lit(col)
.withColumnRenamed(col, "value"))
}
我觉得&#34;用于&#34;或者&#34; foreach&#34;对于像spark这样的分布式环境来说可能很笨拙。我正在做什么的地图功能有替代方案吗?在我看来,aggregateByKey和collect_list可能有效;但是,我无法想象一个完整的解决方案。请指教。
答案 0 :(得分:2)
foldLeft
是Scala中设计的非常强大的功能,如果您知道如何使用它。我建议你使用foldLeft
函数(我已经注释了代码和解释的清晰度)
//selecting the columns without A and amt
val columnsForAggregation = df.columns.tail.toSet - "amt"
//creating an empty dataframe (format for final output
val finalDF = Seq(("empty", "empty", "empty", 0.0)).toDF("A", "field", "value", "amt")
//using foldLeft for the aggregation and merging each aggreted results
import org.apache.spark.sql.functions._
val (originaldf, transformeddf) = columnsForAggregation.foldLeft((df, finalDF)){(tempdf, column) => {
//aggregation on the dataframe with A and one of the column and finally selecting as required in the outptu
val aggregatedf = tempdf._1.groupBy("A", column).agg(sum("amt").as("amt"))
.select(col("A"), lit(column).as("field"), col(column).as("value"), col("amt"))
//union the aggregated results and transferring dataframes for next loop
(df, tempdf._2.union(aggregatedf))
}
}
//finally removing the dummy row created
transformeddf.filter(col("A") =!= "empty")
.show(false)
您应该拥有所需的数据框
+---+-----+-----+---+
|A |field|value|amt|
+---+-----+-----+---+
|A1 |E |E1 |1.0|
|A2 |E |E2 |2.0|
|A1 |F |F1 |1.0|
|A2 |F |F2 |2.0|
|A2 |B |B2 |2.0|
|A1 |B |B1 |1.0|
|A2 |C |C2 |2.0|
|A1 |C |C1 |1.0|
|A1 |D |D1 |1.0|
|A2 |D |D2 |2.0|
+---+-----+-----+---+
我希望答案很有帮助
以上foldLeft
功能的简明形式是
import org.apache.spark.sql.functions._
val (originaldf, transformeddf) = columnsForAggregation.foldLeft((df, finalDF)){(tempdf, column) =>
(df, tempdf._2.union(tempdf._1.groupBy("A", column).agg(sum("amt").as("amt")).select(col("A"), lit(column).as("field"), col(column).as("value"), col("amt"))))
}