我正在使用Scala和Spark来取消透视一个如下所示的表:
+---+----------+--------+-------+------+-----+
| ID| Date | Type1 | Type2 | 0:30 | 1:00|
+---+----------+--------+-------+------+-----+
| G| 12/3/2018| Import|Voltage| 3.5 | 6.8 |
| H| 13/3/2018| Import|Voltage| 7.5 | 9.8 |
| H| 13/3/2018| Export| Watt| 4.5 | 8.9 |
| H| 13/3/2018| Export|Voltage| 5.6 | 9.1 |
+---+----------+--------+-------+------+-----+
我想按如下方式转置它:
| ID|Date | Time|Import-Voltage |Export-Votage|Import-Watt|Export-Watt|
| G|12/3/2018|0:30 |3.5 |0 |0 |0 |
| G|12/3/2018|1:00 |6.8 |0 |0 |0 |
| H|13/3/2018|0:30 |7.5 |5.6 |0 |4.5 |
| H|13/3/2018|1:00 |9.8 |9.1 |0 |8.9 |
Time
和Date
列也应像
12/3/2018 0:30
答案 0 :(得分:2)
这不是一项直截了当的任务,但是一种方法是:
time
和相应的value
分组为time-value
对的“映射” time-value
对的列groupBy-pivot-agg
作为groupBy键的一部分并以time
作为枢轴列来执行types
转换,以聚合时间对应的value
下面的示例代码:
import org.apache.spark.sql.functions._
val df = Seq(
("G", "12/3/2018", "Import", "Voltage", 3.5, 6.8),
("H", "13/3/2018", "Import", "Voltage", 7.5, 9.8),
("H", "13/3/2018", "Export", "Watt", 4.5, 8.9),
("H", "13/3/2018", "Export", "Voltage", 5.6, 9.1)
).toDF("ID", "Date", "Type1", "Type2", "0:30", "1:00")
df.
withColumn("TimeValMap", array(
struct(lit("0:30").as("_1"), $"0:30".as("_2")),
struct(lit("1:00").as("_1"), $"1:00".as("_2"))
)).
withColumn("TimeVal", explode($"TimeValMap")).
withColumn("Time", $"TimeVal._1").
withColumn("Types", concat_ws("-", array($"Type1", $"Type2"))).
groupBy("ID", "Date", "Time").pivot("Types").agg(first($"TimeVal._2")).
orderBy("ID", "Date", "Time").
na.fill(0.0).
show
// +---+---------+----+--------------+-----------+--------------+
// | ID| Date|Time|Export-Voltage|Export-Watt|Import-Voltage|
// +---+---------+----+--------------+-----------+--------------+
// | G|12/3/2018|0:30| 0.0| 0.0| 3.5|
// | G|12/3/2018|1:00| 0.0| 0.0| 6.8|
// | H|13/3/2018|0:30| 5.6| 4.5| 7.5|
// | H|13/3/2018|1:00| 9.1| 8.9| 9.8|
// +---+---------+----+--------------+-----------+--------------+