Spark - 将数据帧行合并为一行

时间:2016-10-27 13:05:37

标签: scala apache-spark dataframe apache-spark-sql

我的数据框包含具有相同ID的行。 我需要将具有相同id的所有行合并为一行(一个json)

以下是数据示例:

id  first_name   last_name
1    JAMES         SMITH
2    MARY          BROWN
2    DAVID         WILLIAMS
1    ROBERT        DAVIS

请求的结果是:

{
  id:1,
  entities: [{
    first_name:JAMES,
    last_name:SMITH 
   }, {
    first_name:ROBERT,
    last_name:DAVIS
  }]
}
{
  id:2,
  entities: [{
    first_name:MARY,
    last_name:BROWN 
   }, {
    first_name:DAVID,
    last_name:WILLIAMS
  }]
}  

可以吗?

问候,Yaniv

1 个答案:

答案 0 :(得分:1)

您可以在"合并"之后使用groupBycollect_list将相关列放入单个嵌套结构中:

val input: DataFrame = Seq(
  (1, "JAMES", "SMITH"),
  (2, "MARY", "BROWN"),
  (2, "DAVID", "WILLIAMS"),
  (1, "ROBERT", "DAVIS")
).toDF("id", "first_name", "last_name")

import org.apache.spark.sql.functions._
val result = input
  .withColumn("entity", struct($"first_name", $"last_name"))
  .groupBy("id").agg(collect_list($"entity"))

result.show(false)
// +---+--------------------------------+
// |id |entities                        |
// +---+--------------------------------+
// |1  |[[JAMES,SMITH], [ROBERT,DAVIS]] |
// |2  |[[MARY,BROWN], [DAVID,WILLIAMS]]|
// +---+--------------------------------+

result.printSchema()
// root
//  |-- id: integer (nullable = false)
//  |-- entities: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- first_name: string (nullable = true)
//  |    |    |-- last_name: string (nullable = true)