我想从2个数据帧中创建一个json(一个是Parent,另一个是child)。子记录应该是形成嵌套JSON的数组
Df1(部门):
+----------+------------+
| dept_Id | dept_name |
+----------+------------+
| 10 | Sales |
+----------+------------+
Df2(员工):
+----------+--------+----------+
| dept_Id | emp_id | emp_name |
+----------+--------+----------+
| 10 | 1001 | John |
| 10 | 1002 | Rich |
+----------+--------+----------+
我希望按如下方式创建JSON:
{
"dept_id":"10",
"dept_name":"Sales",
"employee":[
{ "emp_id":"1001","emp_name":"John" },
{ "emp_id":"1002","emp_name":"Rich" }
]
}
欣赏您的想法。谢谢
答案 0 :(得分:1)
首先将两个数据框结合在一起
val df = df1.join(df2, Seq("dept_Id"))
然后使用groupBy
和collect_list
。这里使用两个case类来获取最终json中的正确名称。这些应该放在main方法之外。
case class Department(dept_Id: Int, dept_name: String, employee: Seq[Employee])
case class Employee(emp_id: Int, emp_name: String)
val dfDept = df.groupBy("dept_id", "dept_name")
.agg(collect_list(struct($"emp_id", $"emp_name")).as("employee"))
.as[Department]
结果数据框:
+-------+---------+--------------------------+
|dept_id|dept_name|employee |
+-------+---------+--------------------------+
|10 |Sales |[[1002,Rich], [1001,John]]|
+-------+---------+--------------------------+
最后,将其另存为json文件:
dfDept .coalesce(1).write.json("department.json")