火花SQL上的多个分组依据和多个显示计数吗?

时间:2019-01-04 10:46:45

标签: sql apache-spark apache-spark-sql

我是Spark的新手,我只想问你一个与Spark SQL有关的问题。让我们考虑一下这个EMPLOYEE表:

Employee     Sub_department   Department 
A               105182          10
A               105182          10   (data can be redundant !)   
A               114256          11
A               127855          12
A               125182          12
B               136234          13
B               133468          13

部门定义为substring(sub_department,0,2),仅提取sub_department的前两位。

我要显示的是划分3种类型的员工:

  • 第1组:员工至少拥有3个不同部门(无论其子部门如何)
  • 组1:员工至少具有5个不同的子部门和2个不同的部门
  • 第3组:拥有至少10个不同子部门且同一个部门的员工

即使在经典SQL中,我也不知道如何执行此操作。但至少,我认为最终输出可能是这样的:

Employee     Sub_department   total_sub_dept  Department  total_dept 
A               105182          4                10           3     
A               114256          4                11           3
A               127855          4                12           3
A               125182          4                12           3

“最后”一列名为“ Set”的列将显示员工可以属于的集合,但是它是可选的,我担心计算这样的值太重了...

对于两列(sub_department和Department),分别显示不同的值和计数很重要。

我有一个很大的表(有很多列和许多可以冗余的数据),所以我想通过在sub_department上使用第一个分区并将其存储在第一个表上来做到这一点。然后,将部门的第二个分区(无论“ sub_department”值如何)存储在第二个表中。最后,根据员工姓名在两个表之间进行内部联接。

但是我得到了一些错误的结果,我不知道是否有更好的方法来做到这一点?或至少要进行优化,因为“部门”列取决于sub_department(一个分组而不是两个分组)。

那么,我该如何解决?我尝试过,但似乎无法将count(column)与2列中每列的同一列相结合...

提前谢谢

2 个答案:

答案 0 :(得分:0)

Salman,如果您不发布到目前为止尝试过的代码,SO上的人将投反对票。我会帮助您解决集合1中的要求,只是为了鼓励您。请尝试理解下面的查询,完成后,设置2和3非常简单。

SELECT 
 employee
 total_dept
FROM
(
 SELECT
  employee
  COUNT(Department) AS total_dept
 FROM
 (
  select 
    employee,
    Sub_department,
    SUBSTRING(Sub_department,0,2) AS Department,
    ROW_NUMBER() OVER (partition by employee,SUBSTRING(Sub_department,0,2)) AS redundancy 
  FROM
  table
 )
 WHERE redundancy = 1
 GROUP BY employee
) WHERE total_dept >= 3

EDIT1:

SELECT 
 full_data.employee,
 full_data.sub_department,
 total_sub_dept_count.total_sub_dept
 full_data.SUBSTRING(Sub_department,0,2) AS Department
 total_dept_count.total_dept
FROM
(
 SELECT
  employee
  COUNT(Department) AS total_dept
 FROM
 (
  select 
    employee,
    Sub_department,
    SUBSTRING(Sub_department,0,2) AS Department,
    ROW_NUMBER() OVER (partition by employee,SUBSTRING(Sub_department,0,2)) AS redundancy 
  FROM
  employee_table
 )
 WHERE redundancy = 1
 GROUP BY employee
) total_dept_count
JOIN
(
 SELECT
  employee
  COUNT(department) AS total_sub_dept
 FROM
 (
  select 
    employee,
    department,
    ROW_NUMBER() OVER (partition by employee,department) AS redundancy 
  FROM
  employee_table
 )
 WHERE redundancy = 1
 GROUP BY employee
) total_sub_dept_count
ON(total_dept_count.employee = total_sub_dept_count.employee)
JOIN
 employee_table full_data
ON(total_sub_dept_count.employee = full_data.employee)

答案 1 :(得分:0)

您可以使用窗口函数collect_set()并获取结果。检查一下

ffmpeg