Spark - 附加多行以创建公共列ID的列

时间:2017-07-30 06:19:03

标签: scala apache-spark apache-spark-sql

我有一个目前看起来像这样的数据框

|col_id|r_id_1|r_id_2|r_id_3|
|   1  |  a1  |  b1  |  c1  |
|   1  |  a2  |  b2  |  c2  |
|   2  |  a3  |  b3  |  c3  |
|   2  |  a4  |  b4  |  c4  |

我希望以

的形式转换它
|col_id|r_id_1|r_id_2|r_id_3|r_id_1|r_id_2|r_id_3|
|   1  |  a1  |  b1  |  c1  |  a2  |  b2  |  c2  |
|   2  |  a3  |  b3  |  c3  |  a4  |  b4  |  c4  |

因此,有2列,其列ID为1,现在它们已按col_id分组,现在应使用现有行生成新列。 注意。每列id的行数相同。

1 个答案:

答案 0 :(得分:1)

这应该做:

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

val df = Seq(
  (1, "a1", "b1", "c1"),
  (1, "a2", "b2", "c2"),
  (2, "a3", "b3", "c3"),
  (2, "a4", "b4", "c4")
).toDF("col_id", "r_id_1", "r_id2", "r_id_3")

val cols = df.columns.tail

df
  .withColumn("rn", 
    row_number().over(Window.partitionBy("col_id").orderBy("r_id_1")))
 .flatMap { row => row.getValuesMap[String](cols).map { 
   case (c, t) => (row.getAs[Int]("col_id"), s"${c}_${row.getAs[Int]("rn")}", t) }}
 .groupBy("_1")
 .pivot("_2")
 .agg(first("_3"))
 .show

+---+-------+-------+--------+--------+--------+--------+                       
| _1|r_id2_1|r_id2_2|r_id_1_1|r_id_1_2|r_id_3_1|r_id_3_2|
+---+-------+-------+--------+--------+--------+--------+
|  1|     b1|     b2|      a1|      a2|      c1|      c2|
|  2|     b3|     b4|      a3|      a4|      c3|      c4|
+---+-------+-------+--------+--------+--------+--------+