我正在尝试在Hive中运行此查询,以仅返回在adimpression表中更常出现的前10个网址。
select
ranked_mytable.url,
ranked_mytable.cnt
from
( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
from
( select url, count(*) cnt
from store.adimpression ai
inner join zuppa.adgroupcreativesubscription agcs
on agcs.id = ai.adgroupcreativesubscriptionid
inner join zuppa.adgroup ag
on ag.id = agcs.adgroupid
where ai.datehour >= '2014-05-15 00:00:00'
and ag.siteid = 1240
group by url
) iq
) ranked_mytable
where
ranked_mytable.rnk <= 10
order by
ranked_mytable.url,
ranked_mytable.rnk desc
;
不幸的是我收到一条错误消息:
FAILED: SemanticException [Error 10002]: Line 26:23 Invalid column reference 'rnk'
我尝试调试它,直到ranked_mytable
子查询一切顺利。我试图评论where ranked_mytable.rnk <= 10
子句,但错误信息一直出现。
答案 0 :(得分:11)
Hive无法通过不在select语句的“output”中的列进行排序。要解决此问题,只需在所选列中包含该列:
select
ranked_mytable.url,
ranked_mytable.cnt,
ranked_mytable.rnk
from
( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
from
( select url, count(*) cnt
from store.adimpression ai
inner join zuppa.adgroupcreativesubscription agcs
on agcs.id = ai.adgroupcreativesubscriptionid
inner join zuppa.adgroup ag
on ag.id = agcs.adgroupid
where ai.datehour >= '2014-05-15 00:00:00'
and ag.siteid = 1240
group by url
) iq
) ranked_mytable
where
ranked_mytable.rnk <= 10
order by
ranked_mytable.url,
ranked_mytable.rnk desc
;
如果您不想在最终输出中使用'rnk'列,我希望您可以将整个内容包装在另一个内部查询中,只需选择'url'和'cnt'字段。
答案 1 :(得分:3)
RANK OVER
不是实现这一目标的最佳功能。
更好的解决方案是使用SORT BY
和LIMIT
的组合。事实上,LIMIT
随机选取表中的行,但如果与SORT BY
函数一起使用,则可以避免这种情况。来自Apache Wiki:
-- Top k queries. The following query returns the top 5 sales records wrt amount.
SET mapred.reduce.tasks = 1 SELECT * FROM sales SORT BY amount
DESC LIMIT 5
可以用这种方式重写查询:
select
iq.url,
iq.cnt
from
( select url, count(*) cnt
from store.adimpression ai
inner join zuppa.adgroupcreativesubscription agcs
on agcs.id = ai.adgroupcreativesubscriptionid
inner join zuppa.adgroup ag
on ag.id = agcs.adgroupid
where ai.datehour >= '2014-05-15 00:00:00'
and ag.siteid = 1240
group by url ) iq
sort by
iq.cnt desc
limit
10
;
答案 2 :(得分:0)
从等级over()删除 iq.url 子句分区,然后重新运行查询。
谢谢! Kamleshkumar Gujarathi
答案 3 :(得分:-1)
将as
放在rnk
变量之前。它应该工作正常。