如何参数化以下spark函数。 groupBy和Pivot值是恒定的。我需要参数化
var final_df_transpose=df_transpose.groupBy("_id").pivot("Type").agg(first("Value").alias("Value"),first("OType").alias("OType"),first("DateTime").alias("DateTime"))
在上述情况下无法动态传递别名。
agg_Map scala.collection.mutable.Map[String,String] = Map( OType -> first, Type -> first, Value -> first, DateTime -> first)
var agg_Map = collection.mutable.Map[String, String]()
for (aggDataCol <- fin_agg_col) {
agg_Map1 += (aggDataCol -> "first")
}
df_transpose.groupBy("_id").pivot("Type").agg(agg_Map.toMap).show
答案 0 :(得分:0)
我可以想到两种方式,但我都不满意。
首先,将聚合定义为Column
的列表。这里令人讨厌的是,为了满足方法签名,您需要添加一个虚拟列,然后在聚合之后将其删除:
scala> val in = spark.read.option("header", true).csv("""_id,Type,Value,OType,DateTime
| 0,a,b,c,d
| 1,aaa,bbb,ccc,ddd""".split("\n").toSeq.toDS)
in: org.apache.spark.sql.DataFrame = [_id: string, Type: string ... 3 more fields]
scala> in.show
+---+----+-----+-----+--------+
|_id|Type|Value|OType|DateTime|
+---+----+-----+-----+--------+
| 0| a| b| c| d|
| 1| aaa| bbb| ccc| ddd|
+---+----+-----+-----+--------+
scala> val aggColumns = Seq("Value", "OType", "DateTime").map{c => first(c).alias(c)}
aggColumns: Seq[org.apache.spark.sql.Column] = List(first(Value, false) AS `Value`, first(OType, false) AS `OType`, first(DateTime, false) AS `DateTime`)
scala> val df_intermediate = in.groupBy("_id").pivot("Type").agg(lit("dummy"), aggColumns : _*)
df_intermediate: org.apache.spark.sql.DataFrame = [_id: string, a_dummy: string ... 7 more fields]
scala> df_intermediate.show
+---+-------+-------+-------+----------+---------+---------+---------+------------+
|_id|a_dummy|a_Value|a_OType|a_DateTime|aaa_dummy|aaa_Value|aaa_OType|aaa_DateTime|
+---+-------+-------+-------+----------+---------+---------+---------+------------+
| 0| dummy| b| c| d| dummy| null| null| null|
| 1| dummy| null| null| null| dummy| bbb| ccc| ddd|
+---+-------+-------+-------+----------+---------+---------+---------+------------+
scala> val df_final = df_intermediate.drop(df_intermediate.schema.collect{case c if c.name.endsWith("_dummy") => c.name} : _*)
df_final: org.apache.spark.sql.DataFrame = [_id: string, a_Value: string ... 5 more fields]
scala> df_final.show
+---+-------+-------+----------+---------+---------+------------+
|_id|a_Value|a_OType|a_DateTime|aaa_Value|aaa_OType|aaa_DateTime|
+---+-------+-------+----------+---------+---------+------------+
| 0| b| c| d| null| null| null|
| 1| null| null| null| bbb| ccc| ddd|
+---+-------+-------+----------+---------+---------+------------+
第二步,继续使用Map
的agg表达式,然后使用正则表达式查找重命名的列并将其改回:
scala> val aggExprs = Map(("OType" -> "first"), ("Value" -> "first"), "DateTime" -> "first")
aggExprs: scala.collection.immutable.Map[String,String] = Map(OType -> first, Value -> first, DateTime -> first)
scala> val df_intermediate = in.groupBy("_id").pivot("Type").agg(aggExprs)
df_intermediate: org.apache.spark.sql.DataFrame = [_id: string, a_first(OType, false): string ... 5 more fields]
scala> df_intermediate.show
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
|_id|a_first(OType, false)|a_first(Value, false)|a_first(DateTime, false)|aaa_first(OType, false)|aaa_first(Value, false)|aaa_first(DateTime, false)|
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
| 0| c| b| d| null| null| null|
| 1| null| null| null| ccc| bbb| ddd|
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
scala> val regex = "^(.*)_first\\((.*), false\\)$".r
regex: scala.util.matching.Regex = ^(.*)_first\((.*), false\)$
scala> val replacements = df_intermediate.schema.collect{ case c if regex.findFirstMatchIn(c.name).isDefined =>
| val regex(pivotVal, colName) = c.name
| c.name -> s"${pivotVal}_$colName"
| }.toMap
replacements: scala.collection.immutable.Map[String,String] = Map(a_first(DateTime, false) -> a_DateTime, aaa_first(DateTime, false) -> aaa_DateTime, aaa_first(OType, false) -> aaa_OType, a_first(Value, false) -> a_Value, a_first(OType, false) -> a_OType, aaa_first(Value, false) -> aaa_Value)
scala> val df_final = replacements.foldLeft(df_intermediate){(df, c) => df.withColumnRenamed(c._1, c._2)}
df_final: org.apache.spark.sql.DataFrame = [_id: string, a_OType: string ... 5 more fields]
scala> df_final.show
+---+-------+-------+----------+---------+---------+------------+
|_id|a_OType|a_Value|a_DateTime|aaa_OType|aaa_Value|aaa_DateTime|
+---+-------+-------+----------+---------+---------+------------+
| 0| c| b| d| null| null| null|
| 1| null| null| null| ccc| bbb| ddd|
+---+-------+-------+----------+---------+---------+------------+
选择,但两者都涉及一些不必要的步骤。