spark2.0 dataframe按列逐列收集muilt行

时间:2016-11-27 09:52:42

标签: scala apache-spark

我有一些类似下面的数据框,如果列值相同,我想将muilt行转换为数组

val data = Seq(("a","b","sum",0),("a","b","avg",2)).toDF("id1","id2","type","value2").show
    +---+---+----+------+
    |id1|id2|type|value2|
    +---+---+----+------+
    |  a|  b| sum|     0|
    |  a|  b| avg|     2|
    +---+---+----+------+

我想将其转换为

+---+---+----+------+
|id1|id2|agg |value2|
+---+---+----+------+
|  a|  b| 0,2|     0|
+---+---+----+------+

printSchema应如下所示

root
 |-- id1: string (nullable = true)
 |-- id2: string (nullable = true)
 |-- agg: struct (nullable = true)
 |    |-- sum: int (nullable = true)
 |    |-- dc: int (nullable = true)

1 个答案:

答案 0 :(得分:1)

你可以:

import org.apache.spark.sql.functions._

val data = Seq(
  ("a","b","sum",0),("a","b","avg",2)
).toDF("id1","id2","type","value2")

val result = data.groupBy($"id1", $"id2").agg(struct(
  first(when($"type" === "sum", $"value2"), true).alias("sum"), 
  first(when($"type" === "avg", $"value2"), true).alias("avg")
).alias("agg"))

result.show

+---+---+-----+   
|id1|id2|  agg|
+---+---+-----+
|  a|  b|[0,2]|
+---+---+-----+

result.printSchema
root
 |-- id1: string (nullable = true)
 |-- id2: string (nullable = true)
 |-- agg: struct (nullable = false)
 |    |-- sum: integer (nullable = true)
 |    |-- avg: integer (nullable = true)