如何改写COUNT DISTINCT?

时间:2019-10-18 14:34:22

标签: sql hive

由于在一个查询中使用了count和distinct,因此在配置单元中仅使用一个reduce存在问题。 如何重写选择以消除这种情况?窗口功能有可能吗?

 select
  a.second_id,
  if(a.proc_id = 'CONST1' and bb.third_id is not null,
     count(distinct bb.first_id),
     '') as qty
from a          a
join (select
        b.first_id,
        b.second_id,
        b.third_id
      from b b) bb
     on bb.second_id = a.second_id
group by
  a.second_id,
  a.proc_id,
  bb.third_id;

1 个答案:

答案 0 :(得分:1)

这是您的查询:

select a.second_id,
       (case when a.proc_id = 'CONST1' and bb.third_id is not null
             then count(distinct bb.first_id)
        end) as qty
from a join
     (select b.first_id, b.second_id, b.third_id
      from b
     ) bb
     on bb.second_id = a.second_id
group by a.second_id, a.proc_id, bb.third_id;

实际上,count(distinct)可以使用group by和窗口函数在子查询中进行处理。我看不到不首先聚合的任何价值,所以:

select a.second_id,
       (case when a.proc_id = 'CONST1' and bb.third_id is not null
             then max(bb.num_firsts)
        end) as qty
from a join
     (select b.second_id, b.third_id,
             count(distinct first_id) as num_firsts
      from b
      group by b.second_id, b.third_id
     ) bb
     on bb.second_id = a.second_id
group by a.second_id, a.proc_id, bb.third_id;

您正在外部查询中按second_idthird_id进行汇总。因此,外部查询中的聚合子查询只有一行。上面的版本使用max(first_id),但是您也可以在外部num_firsts中加入group by

那仍然可能无法解决您的问题,但是此查询更易于修改。我记得,Hive中最好的方法是select distinct子查询:

select a.second_id,
       (case when a.proc_id = 'CONST1' and bb.third_id is not null
             then max(bb.num_firsts)
        end) as qty
from a join
     (select b.second_id, b.third_id,
             count(*) as num_firsts
      from (select distinct second_id, third_id, first_id
            from b
           ) b
      group by b.second_id, b.third_id
     ) bb
     on bb.second_id = a.second_id
group by a.second_id, a.proc_id, bb.third_id;

如果first_id从未为null,这是同一件事。这将被视为一个单独的值;如果您不想,只需将它们过滤掉即可。