通过联接数据框(父级和子级)创建JSON

时间:2018-09-06 02:57:02

标签: json scala apache-spark apache-spark-sql

我想从2个数据帧中创建一个json(一个是Parent,另一个是child)。子记录应该是形成嵌套JSON的数组

Df1(部门):

+----------+------------+
| dept_Id  | dept_name  |
+----------+------------+
| 10       | Sales      |
+----------+------------+

Df2(员工):

+----------+--------+----------+
| dept_Id  | emp_id | emp_name |
+----------+--------+----------+
| 10       | 1001   | John     |
| 10       | 1002   | Rich     |
+----------+--------+----------+

我希望按如下方式创建JSON:

{
 "dept_id":"10",
 "dept_name":"Sales",
 "employee":[ 
        { "emp_id":"1001","emp_name":"John" },
        { "emp_id":"1002","emp_name":"Rich" }
   ]
}

欣赏您的想法。谢谢

1 个答案:

答案 0 :(得分:1)

首先将两个数据框结合在一起

val df = df1.join(df2, Seq("dept_Id"))

然后使用groupBycollect_list。这里使用两个case类来获取最终json中的正确名称。这些应该放在main方法之外。

case class Department(dept_Id: Int, dept_name: String, employee: Seq[Employee])
case class Employee(emp_id: Int, emp_name: String)

val dfDept = df.groupBy("dept_id", "dept_name")
  .agg(collect_list(struct($"emp_id", $"emp_name")).as("employee"))
  .as[Department]

结果数据框:

+-------+---------+--------------------------+
|dept_id|dept_name|employee                  |
+-------+---------+--------------------------+
|10     |Sales    |[[1002,Rich], [1001,John]]|
+-------+---------+--------------------------+

最后,将其另存为json文件:

dfDept .coalesce(1).write.json("department.json")