Question

我正在研究Hive。我想让功能工程师在这张表中选择user_agent列中最常见的前2个值，并将它们全部放在一行中，以汇总信息。

我有一个看起来像这样的数据库：

userID | user_agent 
1      |  Windows NT 6.1
1      |  Windows NT 6.1
1      |  Windows NT 6.1
1      |  Macintosh
1      |  Macintosh
2      |  Windows NT 6.1
2      |  Windows NT 6.1
2      |  Macintosh
2      |  X11
3      |  X11
3      |  X11
4      |  Windows NT 6.1
4      |  X11
5      |  iPhone
6      |  X11
6      |  iPhone
7      |  
7      |  
7      |  
7      |  Windows NT 6.1

需要注意的是，user_agent比示例表中使用的要复杂得多，具有大量唯一值，因此我不能使用虚拟变量。（我已经尝试过了）

我们将最常用的值列称为top_1_user_agent，将第二最常用的值列称为top_2_user_agent。

只有一个值时，top_2_user_agent值必须为null，如用户ID 3。当有一个“绘制”时，例如userID 2和userID 6，所选值必须是表中按顺序排列的第一个。

结果必须如下所示：

userID | top_1_user_agent |   top_2_user_agent 
1      |  Windows NT 6.1  | Macintosh
2      |  Windows NT 6.1  | Macintosh
3      |  X11             | 
4      |  Windows NT 6.1  | X11
5      |  iPhone          | 
6      |  X11             | iPhone    
7      |                  | Windows NT 6.1

欢迎任何帮助。谢谢！

Answer 1

rank()和collect_set()应该这样做。

select userID,collect_set(user_agent)
from 
(
    select *, rank() over (partition by userID,user_agent order by cnt desc) as rank
    from
    (
        select userID,user_agent, count(*) as cnt
        from yourtable
        group by userID,useragent
    ) x
) y
where rank <= 2
group by userID

按ID

1 个答案: