Question

我有下表：

custID  Cat
   1    A
   1    B
   1    B
   1    B
   1    C
   2    A
   2    A
   2    C
   3    B
   3    C
   4    A
   4    C
   4    C
   4    C

我需要的是以最有效的方式聚合CustID，以便获得最频繁的类别（cat），第二个最常见的类别和第三个类别。上面的输出应该是

    most freq   2nd most freq   3rd most freq
1       B             A              C
2       A             C             Null
3       B             C             Null
4       C             A             Null

当计数中存在平局时，我并不关心什么是第一个，什么是第二个。例如，对于客户1，第二大频率和第三大频率可以交换，因为它们中的每一个只发生一次。

任何sql都没问题，最好是hive sql。

谢谢

Answer 1

尝试使用group by两次和dense_rank()来对cat计数进行排序。实际上我并不是100％肯定，但我想它也应该适用于蜂巢。

select custId,
    max(case when t.rn = 1 then cat end) as [most freq],
    max(case when t.rn = 2 then cat end) as [2nd most freq],
    max(case when t.rn = 3 then cat end) as [3th most freq]
from
(
  select custId, cat, dense_rank() over (partition by custId order by count(*) desc) rn
  from your_table 
  group by custId, cat
) t
group by custId

demo

根据评论，我添加了符合Hive SQL的略微修改的解决方案

select custId,
    max(case when t.rn = 1 then cat else null end) as most_freq,
    max(case when t.rn = 2 then cat else null end) as 2nd_most_freq,
    max(case when t.rn = 3 then cat else null end) as 3th_most_freq
from
(
  select custId, cat, dense_rank() over (partition by custId order by ct desc) rn
  from (
    select custId, cat, count(*) ct
    from your_table 
    group by custId, cat
  ) your_table_with_counts
) t
group by custId

Hive SQL demo

Answer 2

SELECT journal, count(*) as frequency
FROM ${hiveconf:TNHIVE}
WHERE journal IS NOT NULL
GROUP BY journal
ORDER BY frequency DESC
LIMIT 5;

分组时

2 个答案: