Spark-选择求和和联接数据集的所有列

时间:2018-10-05 14:58:40

标签: scala apache-spark aggregate-functions

我有2个表Employees(标识,名称),EmployeeSalary(EmployeeId,指定,薪水)。一名员工可以在公司中担任多个职务,并拥有多个薪水。如何获得EmployeeId,姓名,薪金总和,所有指定的Seq。

到目前为止,我尝试过的是

    employeeDS.join(employeeSalaryDS, employeeDS.col("Id")
.equalTo(employeeSalaryDS.col("EmployeeId")),"left_outer")
.groupBy(employeeDS.col("Id")).agg(sum("Salary") as "Sum of salaries")

1 个答案:

答案 0 :(得分:1)

类似这样的东西

scala> val dfe = Seq((101,"John"),(102,"Mike"), (103,"Paul"), (104,"Tom")).toDF("id","name")
dfe: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> val dfes = Seq((101,"Dev", 4000),(102,"Designer", 4000),(102,"Architect", 5000), (103,"Designer",6000), (104,"Consultant",8000), (104,"Supervisor",9000), (104,"PM",10000) ).toDF("id","desig","salary")
dfes: org.apache.spark.sql.DataFrame = [id: int, desig: string ... 1 more field]

scala> dfe.join(dfes, dfe.col("id").equalTo(dfes.col("id")),"left_outer").groupBy(dfe.col("Id")).agg(sum("Salary") as "Sum of salaries", collect_list('desig as "desig_list")).show(false)
+---+---------------+-----------------------------------+
|Id |Sum of salaries|collect_list(desig AS `desig_list`)|
+---+---------------+-----------------------------------+
|101|4000           |[Dev]                              |
|103|6000           |[Designer]                         |
|102|9000           |[Architect, Designer]              |
|104|27000          |[PM, Supervisor, Consultant]       |
+---+---------------+-----------------------------------+


scala>