在Hive中每天获得前N行 - rank()

时间:2017-02-21 20:54:55

标签: sql hive rank

我有这张桌子,每排捐赠一笔:

 sale_date  salesman  sale_item_id
 20170102   JohnSmith       309
 20170102   JohnSmith       292
 20170103   AlexHam          93

我正试图每天获得前20名推销员,我想出了这个:

SELECT sale_date, salesman, sale_count, row_num
FROM (
  SELECT sale_date, salesman,
         count(*) as sale_count,
         rank() over (partition by sale_date order by sale_count desc) as row_num
  from salesforce.sales_data
) T
WHERE sale_date between  '20170101' and '20170110'
 and row_num <= 20

但我明白了:

FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 5:35 Expression not in GROUP BY key 'sale_date'

我不确定分组会在什么时候生效。有人可以帮忙吗? TX!

2 个答案:

答案 0 :(得分:3)

您在子查询中缺少group by

SELECT sale_date, salesman, sale_count, row_num
FROM (SELECT sale_date, salesman,
             count(*) as sale_count,
             rank() over (partition by sale_date order by count(*) desc) as row_num
      FROM salesforce.sales_data
      GROUP BY sale_date, salesman
     ) T
WHERE sale_date between '20170101' and '20170110' and row_num <= 20;

我认为Hive会接受order byorder by sale_count desc中的列别名。

另请注意,如果有联系,您可以获得多于或少于20行。如果您只需要20行,则可能需要row_number()

答案 1 :(得分:0)

试试这个

SELECT sale_date, salesman, sale_count, row_num from (
SELECT sale_date, salesman, sale_count,
 rank() over (partition by sale_date order by sale_count desc) as         row_num
from 
(
SELECT sale_date, salesman,
   count(*) over (partition by salesman) as sale_count
from  employee
) t1
) t2  where sale_date between  '20170101' and '20170110'
and row_num <= 20;
WHERE sale_date between  '20170101' and '20170110'
and row_num <= 20

编辑和测试。你的问题基本上是你在为你的over子句计算它之前尝试使用计数,如果你在推销员的子查询中计算你的计数,它将解决问题。您无法在销售查询中执行分组操作,如果这样做,您将无法访问sale_date。