Question

我的数据框包含具有相同ID的行。我需要将具有相同id的所有行合并为一行（一个json）

以下是数据示例：

id  first_name   last_name
1    JAMES         SMITH
2    MARY          BROWN
2    DAVID         WILLIAMS
1    ROBERT        DAVIS

请求的结果是：

{
  id:1,
  entities: [{
    first_name:JAMES,
    last_name:SMITH 
   }, {
    first_name:ROBERT,
    last_name:DAVIS
  }]
}
{
  id:2,
  entities: [{
    first_name:MARY,
    last_name:BROWN 
   }, {
    first_name:DAVID,
    last_name:WILLIAMS
  }]
}

可以吗？

问候，Yaniv

Answer 1

您可以在＆＃34;合并＆＃34;之后使用groupBy和collect_list将相关列放入单个嵌套结构中：

val input: DataFrame = Seq(
  (1, "JAMES", "SMITH"),
  (2, "MARY", "BROWN"),
  (2, "DAVID", "WILLIAMS"),
  (1, "ROBERT", "DAVIS")
).toDF("id", "first_name", "last_name")

import org.apache.spark.sql.functions._
val result = input
  .withColumn("entity", struct($"first_name", $"last_name"))
  .groupBy("id").agg(collect_list($"entity"))

result.show(false)
// +---+--------------------------------+
// |id |entities                        |
// +---+--------------------------------+
// |1  |[[JAMES,SMITH], [ROBERT,DAVIS]] |
// |2  |[[MARY,BROWN], [DAVID,WILLIAMS]]|
// +---+--------------------------------+

result.printSchema()
// root
//  |-- id: integer (nullable = false)
//  |-- entities: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- first_name: string (nullable = true)
//  |    |    |-- last_name: string (nullable = true)

Spark - 将数据帧行合并为一行

1 个答案: