Question

我正在尝试在Hive中运行此查询，以仅返回在adimpression表中更常出现的前10个网址。

select
        ranked_mytable.url,
        ranked_mytable.cnt

from
        ( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
        from
                ( select url, count(*) cnt
                from store.adimpression ai
                        inner join zuppa.adgroupcreativesubscription agcs
                                on agcs.id = ai.adgroupcreativesubscriptionid
                        inner join zuppa.adgroup ag
                                on ag.id = agcs.adgroupid
                where ai.datehour >= '2014-05-15 00:00:00'
                        and ag.siteid = 1240
                group by url
                ) iq
        ) ranked_mytable

where
      ranked_mytable.rnk <= 10

order by
        ranked_mytable.url,
        ranked_mytable.rnk desc

;

不幸的是我收到一条错误消息：

FAILED: SemanticException [Error 10002]: Line 26:23 Invalid column reference 'rnk'

我尝试调试它，直到ranked_mytable子查询一切顺利。我试图评论where ranked_mytable.rnk <= 10子句，但错误信息一直出现。

Answer 1

Hive无法通过不在select语句的“output”中的列进行排序。要解决此问题，只需在所选列中包含该列：

select
        ranked_mytable.url,
        ranked_mytable.cnt,
        ranked_mytable.rnk

from
        ( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
        from
                ( select url, count(*) cnt
                from store.adimpression ai
                        inner join zuppa.adgroupcreativesubscription agcs
                                on agcs.id = ai.adgroupcreativesubscriptionid
                        inner join zuppa.adgroup ag
                                on ag.id = agcs.adgroupid
                where ai.datehour >= '2014-05-15 00:00:00'
                        and ag.siteid = 1240
                group by url
                ) iq
        ) ranked_mytable

where
      ranked_mytable.rnk <= 10

order by
        ranked_mytable.url,
        ranked_mytable.rnk desc

;

如果您不想在最终输出中使用'rnk'列，我希望您可以将整个内容包装在另一个内部查询中，只需选择'url'和'cnt'字段。

Answer 2

RANK OVER不是实现这一目标的最佳功能。更好的解决方案是使用SORT BY和LIMIT的组合。事实上，LIMIT随机选取表中的行，但如果与SORT BY函数一起使用，则可以避免这种情况。来自Apache Wiki:

-- Top k queries. The following query returns the top 5 sales records wrt amount. 
SET mapred.reduce.tasks = 1 SELECT * FROM sales SORT BY amount
DESC LIMIT 5

可以用这种方式重写查询：

select
        iq.url,
        iq.cnt

from
        ( select url, count(*) cnt
        from store.adimpression ai
          inner join zuppa.adgroupcreativesubscription agcs
            on agcs.id = ai.adgroupcreativesubscriptionid
          inner join zuppa.adgroup ag
            on ag.id = agcs.adgroupid
        where ai.datehour >= '2014-05-15 00:00:00'
          and ag.siteid = 1240
        group by url ) iq

sort by
        iq.cnt desc

limit
        10

;

Answer 3

从等级over（）删除 iq.url 子句分区，然后重新运行查询。

谢谢！ Kamleshkumar Gujarathi

Answer 4

将as放在rnk变量之前。它应该工作正常。

在Hive中使用RANK OVER功能

4 个答案: