具有多列的枢轴数据框 - Spark/Scala

时间:2021-06-15 05:19:05

标签: scala apache-spark pivot

我有一个如下所示的数据框

+----------+----+----+----+
|      date|col1|col2|col3|
+----------+----+----+----+
|2021-05-01|  20|  30|  40|
|2021-05-02| 200| 300|  10|
+----------+----+----+----+

我希望将此数据帧转置/转置为

+-----+----------+----------+
|col  |2021-05-01|2021-05-02|
+-----+----------+----------+
|Col1 |        20|       200|
|Col1 |        30|       300|
|Col1 |        40|        10|
+-----+----------+----------+

thisthis 等其他 stackoverflow 文章在某种程度上帮助了我,但我已经找到了解决方案。

我的方法是(所有失败的尝试)

scala> dUnion.groupBy("date").pivot("date").agg(first("col1")).show()
+----------+----------+----------+
|      date|2021-05-01|2021-05-02|
+----------+----------+----------+
|2021-05-02|      null|       200|
|2021-05-01|        20|      null|
+----------+----------+----------+

scala> dUnion.groupBy("date", "col1", "col2", "col3").pivot("date").agg(first("col1")).show()
+----------+----+----+----+----------+----------+
|      date|col1|col2|col3|2021-05-01|2021-05-02|
+----------+----+----+----+----------+----------+
|2021-05-02| 200| 300|  10|      null|       200|
|2021-05-01|  20|  30|  40|        20|      null|
+----------+----+----+----+----------+----------+

但我能想到的壁橱是

scala> dUnion.groupBy().pivot("date").agg(first("col1")).show()
+----------+----------+
|2021-05-01|2021-05-02|
+----------+----------+
|        20|       200|
+----------+----------+

1 个答案:

答案 0 :(得分:1)

这是可能的,但我认为这有点慢。

val schema = df.schema
val longForm = df.flatMap(row => {
    val col = row.getString(0)
    (1 until row.size).map(i => {
        (col, schema(i).name, row.getString(i))
    })
})

longForm.groupBy('_2).pivot('_1).agg(first('_3))
.withColumnRenamed("_2", "col").show(10, false)


+----+----------+----------+
|col |2021-05-01|2021-05-02|
+----+----------+----------+
|col3|40        |10        |
|col1|20        |200       |
|col2|30        |300       |
+----+----------+----------+