我是Spark的新手,我只想问你一个与Spark SQL有关的问题。让我们考虑一下这个EMPLOYEE表:
Employee Sub_department Department
A 105182 10
A 105182 10 (data can be redundant !)
A 114256 11
A 127855 12
A 125182 12
B 136234 13
B 133468 13
部门定义为substring(sub_department,0,2),仅提取sub_department的前两位。
我要显示的是划分3种类型的员工:
即使在经典SQL中,我也不知道如何执行此操作。但至少,我认为最终输出可能是这样的:
Employee Sub_department total_sub_dept Department total_dept
A 105182 4 10 3
A 114256 4 11 3
A 127855 4 12 3
A 125182 4 12 3
“最后”一列名为“ Set”的列将显示员工可以属于的集合,但是它是可选的,我担心计算这样的值太重了...
对于两列(sub_department和Department),分别显示不同的值和计数很重要。
我有一个很大的表(有很多列和许多可以冗余的数据),所以我想通过在sub_department上使用第一个分区并将其存储在第一个表上来做到这一点。然后,将部门的第二个分区(无论“ sub_department”值如何)存储在第二个表中。最后,根据员工姓名在两个表之间进行内部联接。
但是我得到了一些错误的结果,我不知道是否有更好的方法来做到这一点?或至少要进行优化,因为“部门”列取决于sub_department(一个分组而不是两个分组)。
那么,我该如何解决?我尝试过,但似乎无法将count(column)与2列中每列的同一列相结合...
提前谢谢
答案 0 :(得分:0)
Salman,如果您不发布到目前为止尝试过的代码,SO上的人将投反对票。我会帮助您解决集合1中的要求,只是为了鼓励您。请尝试理解下面的查询,完成后,设置2和3非常简单。
SELECT
employee
total_dept
FROM
(
SELECT
employee
COUNT(Department) AS total_dept
FROM
(
select
employee,
Sub_department,
SUBSTRING(Sub_department,0,2) AS Department,
ROW_NUMBER() OVER (partition by employee,SUBSTRING(Sub_department,0,2)) AS redundancy
FROM
table
)
WHERE redundancy = 1
GROUP BY employee
) WHERE total_dept >= 3
EDIT1:
SELECT
full_data.employee,
full_data.sub_department,
total_sub_dept_count.total_sub_dept
full_data.SUBSTRING(Sub_department,0,2) AS Department
total_dept_count.total_dept
FROM
(
SELECT
employee
COUNT(Department) AS total_dept
FROM
(
select
employee,
Sub_department,
SUBSTRING(Sub_department,0,2) AS Department,
ROW_NUMBER() OVER (partition by employee,SUBSTRING(Sub_department,0,2)) AS redundancy
FROM
employee_table
)
WHERE redundancy = 1
GROUP BY employee
) total_dept_count
JOIN
(
SELECT
employee
COUNT(department) AS total_sub_dept
FROM
(
select
employee,
department,
ROW_NUMBER() OVER (partition by employee,department) AS redundancy
FROM
employee_table
)
WHERE redundancy = 1
GROUP BY employee
) total_sub_dept_count
ON(total_dept_count.employee = total_sub_dept_count.employee)
JOIN
employee_table full_data
ON(total_sub_dept_count.employee = full_data.employee)
答案 1 :(得分:0)
您可以使用窗口函数collect_set()并获取结果。检查一下
ffmpeg