通过两个Hive表的连接创建一个struct数据类型数组

时间:2018-10-30 16:45:05

标签: scala apache-spark hive bigdata

我在Hive中有两个表-

 emp(empid int,empname string,deptid string)
 dept(deptid string, deptname string)

样本数据

Hive中的Emp表具有模式empid int,empname字符串,deptid字符串

 1,Monami Sen,D01
 2,Tarun Sen,D02
 3,Shovik Sen,D03
 4, Rita Roy,D02
 5,Farhan,D01

Hive中的部门表具有架构deptid字符串,部门名字符串

 D01,Finance
 D02,IT
 D03,Accounts
 D04,Admin

我需要创建另一个具有以下架构的配置单元表-

dept id string, dept name string, emp_details array<struct<emp_id:string,emp_name string>>

struct属性数组应包含所有员工详细信息-属于特定部门的empid和empname以及最终数据帧应转换为JSON格式。

所需的输出:

{"deptid":"D01","deptname":"IT","empdetails":[{"empid":1,"empname":"Monami Sen"}]}
{"deptid":"D02","deptname":"Accounts","empdetails":[{"empid":2,"empname":"Rita Roy"}, 
{"empid":5,"empname":"Rijul Shah"}]}
{"deptid":"D03","deptname":"Finance","empdetails":[{"empid":3,"empname":"Shovik Sen"},{"empid":4,"empname":"Arghya Ghosh"}]}
{"deptid":"D04","deptname":"Adminstration","empdetails":[]}

我需要使用Spark 1.6版和Scala 2.10进行编码。数据集非常庞大,因此需要高效的代码处理才能获得最佳性能。

您能为我提供有关代码的任何建议吗?

1 个答案:

答案 0 :(得分:0)

我建议执行一个left_outer连接,然后进行一个groupBy/collect_list聚合和一个toJSON转换,如下所示:

val empDF = Seq(
  (1, "Monami Sen", "D01"),
  (2, "Tarun Sen", "D02"),
  (3, "Shovik Sen", "D03"),
  (4, "Rita Roy", "D02"),
  (5, "Farhan", "D01")
).toDF("empid", "empname", "deptid")

val deptDF = Seq(
  ("D01", "Finance"),
  ("D02", "IT"),
  ("D03", "Accounts"),
  ("D04", "Admin")
).toDF("deptid", "deptname")

deptDF.join(empDF, Seq("deptid"), "left_outer").
  groupBy("deptid", "deptname").
  agg(collect_list(struct($"empid", $"empname")).as("empdetails")).
  toJSON.
  show(false)
// +----------------------------------------------------------------------------------------------------------------------+
// |value                                                                                                                 |
// +----------------------------------------------------------------------------------------------------------------------+
// |{"deptid":"D03","deptname":"Accounts","empdetails":[{"empid":3,"empname":"Shovik Sen"}]}                              |
// |{"deptid":"D02","deptname":"IT","empdetails":[{"empid":4,"empname":"Rita Roy"},{"empid":2,"empname":"Tarun Sen"}]}    |
// |{"deptid":"D01","deptname":"Finance","empdetails":[{"empid":5,"empname":"Farhan"},{"empid":1,"empname":"Monami Sen"}]}|
// |{"deptid":"D04","deptname":"Admin","empdetails":[{}]}                                                                 |
// +----------------------------------------------------------------------------------------------------------------------+

对于Spark 1.6,请考虑通过Spark SQL进行汇总(因为collect_list在Spark DataFrame API中似乎不支持非原始字段类型):

deptDF.join(empDF, Seq("deptid"), "left_outer").
  createOrReplaceTempView("joined_table")

val resultDF = sqlContext.sql("""
  select deptid, deptname, collect_list(struct(empid, empname)) as empdetails
  from joined_table
  group by deptid, deptname
""")

resultDF.toJSON.
  show(false)