我有一个包含几百万行数据的表,如下所示:
+---------------+--------------+-------------------+
| page | search_term | interactions |
+---------------+--------------+-------------------+
| /mom | pizza | 15 |
| /dad | pizza | 8 |
| /uncle | pizza | 2 |
| /brother | pizza | 7 |
| /mom | pasta | 12 |
| /dad | pasta | 23 |
+---------------+--------------+-------------------+
我的目标是运行一个HiveQL查询,该查询将返回最大的'互动'每个唯一页面/术语组合的编号。例如:
+---------------+--------------+-------------------+
| page | search_term | interactions |
+---------------+--------------+-------------------+
| /dad | pasta | 23 |
| /mom | pizza | 15 |
+---------------+--------------+-------------------+
考虑到每个唯一页面都有成千上万的search_terms,我怎么写这个呢,但我只想把一个search_term拉到最多的交互?
我尝试过使用max(interaction)和max(struct(interact,search_term))。col1但是没有运气。无论有多少交互,我的输出始终为每个页面提供所有search_terms。
谢谢!
答案 0 :(得分:0)
使用row_number()分析函数:
select page, search_term, interactions
from
(select page, search_term, interactions,
row_number() over (partition by page order by interactions desc ) rn
)s
where rn = 1;