+---+---+----+
| id|key|name|
+---+---+----+
| 10| 1| a|
| 11| 1| b|
| 12| 1| c|
| 20| 2| d|
| 21| 2| e|
| 30| 3| f|
| 31| 3| g|
| 32| 3| h|
| 33| 3| i|
| 40| 4| j|
| 41| 4| k|
| 42| 4| l|
| 43| 4| m|
| 44| 4| n|
+---+---+----+
val DF = Seq((10, 1, "a"), (11, 1, "b"), (12, 1, "c"), (20, 2, "d"), (21, 2,"e"), (30, 3, "f"), (31, 3, "g"), (32, 3, "h"), (33, 3, "i"), (40, 4, "j"), (41, 4, "k"), (42, 4, "l"), (43, 4, "m"), (44, 4, "n")).toDF("id", "key", "name")
我正在尝试获得以下输出(在按键聚合将行值重塑为列值时,名称和id分别限制为4列):
|key|name_1|id_1|name_2|id_2|name_3|id_3|name_4|id_4|
| 1| a| 10| b| 11| c| 12| null|null|
| 2| d| 20| e| 21| null|null| null|null|
| 3| f| 30| g| 31| h| 32| i| 33|
| 4| j| 40| k| 41| l| 42| m| 43|
我是Scala-Spark的初学者。 欢迎任何帮助/建议/问题。
答案 0 :(得分:1)
使用键聚合行(groupBy),并将ID和名称收集到列表中。将密钥收集到列表中后,您应该能够在select语句中引用它们。
import org.apache.spark.sql.functions.collect_list
df.groupBy(df("key"))
.agg(collect_list(df("name")).alias("name_list"),
collect_list("id").alias("id_list"))
.selectExpr(
"key",
"name_list[0] as name_1",
"id_list[0] as id_1",
"name_list[1] as name_2",
"id_list[1] as id_2",
"name_list[2] as name_3",
"id_list[2] as id_3",
"name_list[3] as name_4",
"id_list[3] as id_4"
).show
+---+------+----+------+----+------+----+------+----+
|key|name_1|id_1|name_2|id_2|name_3|id_3|name_4|id_4|
+---+------+----+------+----+------+----+------+----+
| 1| a| 10| b| 11| c| 12| null|null|
| 3| f| 30| g| 31| h| 32| i| 33|
| 4| j| 40| k| 41| l| 42| m| 43|
| 2| d| 20| e| 21| null|null| null|null|
+---+------+----+------+----+------+----+------+----+