在每个类别中选择前N个最常购买的物品

时间:2018-12-25 15:54:15

标签: sql hive hiveql

我正在使用HiveQL,我需要在每个类别中选择购买次数最多的10个商品。我想使用常规SQL可以轻松解决相同的问题。

有什么方法比下面的代码片段更快吗?我只是不明白如何在这里使用所谓的 window函数 ...

SELECT item, 
COUNT(item) AS freq FROM mytable WHERE category='category1' GROUP BY item ORDER BY freq DESC LIMIT 1
union all SELECT item, COUNT(item) AS freq FROM mytable WHERE category='category2' GROUP BY product ORDER BY freq DESC LIMIT 1
union all SELECT item, COUNT(item) AS freq FROM mytable WHERE category='category3' GROUP BY item ORDER BY freq DESC LIMIT 1
union all SELECT item, COUNT(item) AS freq FROM mytable WHERE category='category4' GROUP BY item ORDER BY freq DESC LIMIT 1
...

表数据结构:

item1 category1
item2 category1
item2 category1
item5 category2
item5 category2
item4 category3
item2 category4

结果应为:

item2 category1
item5 category2
item4 category3
item2 category4

1 个答案:

答案 0 :(得分:2)

使用row_number()group by

SELECT category, item, freq
FROM (SELECT category, item, COUNT(*) AS freq,
             ROW_NUMBER() OVER (PARTITION BY category ORDER BY COUNT(*) DESC) as seqnum
      FROM mytable 
      GROUP BY category, item
     ) ci
WHERE seqnum = 1;

即使有最常见的联系,这也会为每个类别返回一行。如果您希望在ties的情况下拥有所有可能,请使用rank()代替row_number()