Question

由于在一个查询中使用了count和distinct，因此在配置单元中仅使用一个reduce存在问题。如何重写选择以消除这种情况？窗口功能有可能吗？

 select
  a.second_id,
  if(a.proc_id = 'CONST1' and bb.third_id is not null,
     count(distinct bb.first_id),
     '') as qty
from a          a
join (select
        b.first_id,
        b.second_id,
        b.third_id
      from b b) bb
     on bb.second_id = a.second_id
group by
  a.second_id,
  a.proc_id,
  bb.third_id;

Answer 1

这是您的查询：

select a.second_id,
       (case when a.proc_id = 'CONST1' and bb.third_id is not null
             then count(distinct bb.first_id)
        end) as qty
from a join
     (select b.first_id, b.second_id, b.third_id
      from b
     ) bb
     on bb.second_id = a.second_id
group by a.second_id, a.proc_id, bb.third_id;

实际上，count(distinct)可以使用group by和窗口函数在子查询中进行处理。我看不到不首先聚合的任何价值，所以：

select a.second_id,
       (case when a.proc_id = 'CONST1' and bb.third_id is not null
             then max(bb.num_firsts)
        end) as qty
from a join
     (select b.second_id, b.third_id,
             count(distinct first_id) as num_firsts
      from b
      group by b.second_id, b.third_id
     ) bb
     on bb.second_id = a.second_id
group by a.second_id, a.proc_id, bb.third_id;

您正在外部查询中按second_id和third_id进行汇总。因此，外部查询中的聚合子查询只有一行。上面的版本使用max(first_id)，但是您也可以在外部num_firsts中加入group by。

那仍然可能无法解决您的问题，但是此查询更易于修改。我记得，Hive中最好的方法是select distinct子查询：

select a.second_id,
       (case when a.proc_id = 'CONST1' and bb.third_id is not null
             then max(bb.num_firsts)
        end) as qty
from a join
     (select b.second_id, b.third_id,
             count(*) as num_firsts
      from (select distinct second_id, third_id, first_id
            from b
           ) b
      group by b.second_id, b.third_id
     ) bb
     on bb.second_id = a.second_id
group by a.second_id, a.proc_id, bb.third_id;

如果first_id从未为null，这是同一件事。这将被视为一个单独的值；如果您不想，只需将它们过滤掉即可。

如何改写COUNT DISTINCT？

1 个答案: