有没有一种方法可以向透视数据框添加结束列?

时间:2019-06-12 20:22:10

标签: scala apache-spark pivot

假设我有下一个数据框:

val df = spark.sparkContext.parallelize(Seq(
        ("A", "12", 50),
        ("A", "13", 100),
        ("A", "14", 30),
        ("B", "15", 40),
        ("C", "16", 60),
        ("C", "17", 70)
      )).toDF("Name", "Time", "Value")

然后按“时间”旋转:

val pivoted = df.groupBy($"Name").
    pivot("Time").
    agg(coalesce(sum($"Value"),lit(0)))

pivoted.show()

结果为:

+----+----+----+----+----+----+----+
|Name|  12|  13|  14|  15|  16|  17|
+----+----+----+----+----+----+----+
|   B|null|null|null|  40|null|null|
|   C|null|null|null|null|  60|  70|
|   A|  50| 100|  30|null|null|null|
+----+----+----+----+----+----+----+

直到那时,一切都还好。我想要的是在“第17列”旁边添加一列,以计算每一行的总和。因此,预期输出应为:

+----+----+----+----+----+----+----+----+
|Name|  12|  13|  14|  15|  16|  17|sum |
+----+----+----+----+----+----+----+----+
|   B|null|null|null|  40|null|null|40  |
|   C|null|null|null|null|  60|  70|130 |
|   A|  50| 100|  30|null|null|null|180 |
+----+----+----+----+----+----+----+----+

(Noobly,)我尝试添加“ withColumn”,但失败了:

val pivotedWithSummation = df.groupBy($"Name").
    pivot("Time").
    agg(coalesce(sum($"Value"),lit(0))).
    withColumn("summation", sum($"Value"))

我附带了这个answer,但我无法应用它:/

我正在使用Scala v.2.11.8和Spark 2.3.1

谢谢!

1 个答案:

答案 0 :(得分:1)

从原始输入数据框中获取值的总和,然后与透视数据框合并

scala> val pivoted = df.groupBy($"Name").pivot("Time").agg(coalesce(sum($"Value"),lit(0)))
pivoted: org.apache.spark.sql.DataFrame = [Name: string, 12: bigint ... 5 more fields]

scala> pivoted.show
+----+----+----+----+----+----+----+
|Name|  12|  13|  14|  15|  16|  17|
+----+----+----+----+----+----+----+
|   B|null|null|null|  40|null|null|
|   C|null|null|null|null|  60|  70|
|   A|  50| 100|  30|null|null|null|
+----+----+----+----+----+----+----+


scala> val sumOfValuesDF = df.groupBy($"Name").sum("value")
sumOfValuesDF: org.apache.spark.sql.DataFrame = [Name: string, sum(value): bigint]

scala> sumOfValuesDF.show
+----+----------+
|Name|sum(value)|
+----+----------+
|   B|        40|
|   C|       130|
|   A|       180|
+----+----------+


scala> val pivotedWithSummation = pivoted.join(sumOfValuesDF, "Name")
pivotedWithSummation: org.apache.spark.sql.DataFrame = [Name: string, 12: bigint ... 6 more fields]

scala> pivotedWithSummation.show
+----+----+----+----+----+----+----+----------+
|Name|  12|  13|  14|  15|  16|  17|sum(value)|
+----+----+----+----+----+----+----+----------+
|   B|null|null|null|  40|null|null|        40|
|   C|null|null|null|null|  60|  70|       130|
|   A|  50| 100|  30|null|null|null|       180|
+----+----+----+----+----+----+----+----------+