我有一个目前看起来像这样的数据框
|col_id|r_id_1|r_id_2|r_id_3|
| 1 | a1 | b1 | c1 |
| 1 | a2 | b2 | c2 |
| 2 | a3 | b3 | c3 |
| 2 | a4 | b4 | c4 |
我希望以
的形式转换它|col_id|r_id_1|r_id_2|r_id_3|r_id_1|r_id_2|r_id_3|
| 1 | a1 | b1 | c1 | a2 | b2 | c2 |
| 2 | a3 | b3 | c3 | a4 | b4 | c4 |
因此,有2列,其列ID为1,现在它们已按col_id分组,现在应使用现有行生成新列。 注意。每列id的行数相同。
答案 0 :(得分:1)
这应该做:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "a1", "b1", "c1"),
(1, "a2", "b2", "c2"),
(2, "a3", "b3", "c3"),
(2, "a4", "b4", "c4")
).toDF("col_id", "r_id_1", "r_id2", "r_id_3")
val cols = df.columns.tail
df
.withColumn("rn",
row_number().over(Window.partitionBy("col_id").orderBy("r_id_1")))
.flatMap { row => row.getValuesMap[String](cols).map {
case (c, t) => (row.getAs[Int]("col_id"), s"${c}_${row.getAs[Int]("rn")}", t) }}
.groupBy("_1")
.pivot("_2")
.agg(first("_3"))
.show
+---+-------+-------+--------+--------+--------+--------+
| _1|r_id2_1|r_id2_2|r_id_1_1|r_id_1_2|r_id_3_1|r_id_3_2|
+---+-------+-------+--------+--------+--------+--------+
| 1| b1| b2| a1| a2| c1| c2|
| 2| b3| b4| a3| a4| c3| c4|
+---+-------+-------+--------+--------+--------+--------+