我在Hive中有两个表-
emp(empid int,empname string,deptid string)
dept(deptid string, deptname string)
样本数据
Hive中的Emp表具有模式empid int,empname字符串,deptid字符串
1,Monami Sen,D01
2,Tarun Sen,D02
3,Shovik Sen,D03
4, Rita Roy,D02
5,Farhan,D01
Hive中的部门表具有架构deptid字符串,部门名字符串
D01,Finance
D02,IT
D03,Accounts
D04,Admin
我需要创建另一个具有以下架构的配置单元表-
dept id string, dept name string, emp_details array<struct<emp_id:string,emp_name string>>
struct属性数组应包含所有员工详细信息-属于特定部门的empid和empname以及最终数据帧应转换为JSON格式。
所需的输出:
{"deptid":"D01","deptname":"IT","empdetails":[{"empid":1,"empname":"Monami Sen"}]}
{"deptid":"D02","deptname":"Accounts","empdetails":[{"empid":2,"empname":"Rita Roy"},
{"empid":5,"empname":"Rijul Shah"}]}
{"deptid":"D03","deptname":"Finance","empdetails":[{"empid":3,"empname":"Shovik Sen"},{"empid":4,"empname":"Arghya Ghosh"}]}
{"deptid":"D04","deptname":"Adminstration","empdetails":[]}
我需要使用Spark 1.6版和Scala 2.10进行编码。数据集非常庞大,因此需要高效的代码处理才能获得最佳性能。
您能为我提供有关代码的任何建议吗?
答案 0 :(得分:0)
我建议执行一个left_outer
连接,然后进行一个groupBy/collect_list
聚合和一个toJSON
转换,如下所示:
val empDF = Seq(
(1, "Monami Sen", "D01"),
(2, "Tarun Sen", "D02"),
(3, "Shovik Sen", "D03"),
(4, "Rita Roy", "D02"),
(5, "Farhan", "D01")
).toDF("empid", "empname", "deptid")
val deptDF = Seq(
("D01", "Finance"),
("D02", "IT"),
("D03", "Accounts"),
("D04", "Admin")
).toDF("deptid", "deptname")
deptDF.join(empDF, Seq("deptid"), "left_outer").
groupBy("deptid", "deptname").
agg(collect_list(struct($"empid", $"empname")).as("empdetails")).
toJSON.
show(false)
// +----------------------------------------------------------------------------------------------------------------------+
// |value |
// +----------------------------------------------------------------------------------------------------------------------+
// |{"deptid":"D03","deptname":"Accounts","empdetails":[{"empid":3,"empname":"Shovik Sen"}]} |
// |{"deptid":"D02","deptname":"IT","empdetails":[{"empid":4,"empname":"Rita Roy"},{"empid":2,"empname":"Tarun Sen"}]} |
// |{"deptid":"D01","deptname":"Finance","empdetails":[{"empid":5,"empname":"Farhan"},{"empid":1,"empname":"Monami Sen"}]}|
// |{"deptid":"D04","deptname":"Admin","empdetails":[{}]} |
// +----------------------------------------------------------------------------------------------------------------------+
对于Spark 1.6
,请考虑通过Spark SQL进行汇总(因为collect_list
在Spark DataFrame API中似乎不支持非原始字段类型):
deptDF.join(empDF, Seq("deptid"), "left_outer").
createOrReplaceTempView("joined_table")
val resultDF = sqlContext.sql("""
select deptid, deptname, collect_list(struct(empid, empname)) as empdetails
from joined_table
group by deptid, deptname
""")
resultDF.toJSON.
show(false)