Hive SQL。在多个列中找到最受欢迎的价值

时间:2018-06-29 08:54:31

标签: hive hiveql

我有以下数据:

name device operating browser
 A     mob      l       c
 A     mob      l       b
 A     mob      l       b
 A     web      w       b
 B     web      w       c
 B     web      w       c
 B     mob      w       c
 B     web      l       b

我想为每一列中的每个名称找到最通用的值,因此结果将如下所示:

name device operating browser
 A     mob      l       b
 B     web      w       c

我该如何实现?谢谢!

2 个答案:

答案 0 :(得分:0)

可能会有所帮助。 但是请注意,使用子查询并不是很好。

SELECT
a.name,
(SELECT b.device FROM YOUR_TABLE_NAME b WHERE b.name = a.name GROUP BY device ORDER BY COUNT(b.device) DESC LIMIT 1) AS device,
(SELECT c.operating FROM YOUR_TABLE_NAME c WHERE c.name = a.name GROUP BY operating ORDER BY COUNT(c.operating) DESC LIMIT 1) AS operating,
(SELECT d.browser FROM YOUR_TABLE_NAME d WHERE d.name = a.name GROUP BY browser ORDER BY COUNT(d.browser) DESC LIMIT 1) AS browser
FROM YOUR_TABLE_NAME AS a
GROUP BY a.name

答案 1 :(得分:0)

对于Hive 0.11+,您可以使用rank之类的窗口函数:

select name, device, operating, browser
from (
  select *, rank() over (partition by name order by cnt desc) as rnk
  from (
    select name, device, operating, browser, count(*) as cnt
    from yourtable
    group by name, device, operating, browser
  ) t
) t
where rnk = 1

逐步:

  1. 计算相同行值的出现次数
  2. 将其排在首位,每个名字最高
  3. 仅过滤计数最高的那些

注意:如果特定名称之间有平局,它将返回所有具有相同计数编号的行。