SQL - 使用dense_rank()的窗口函数

时间:2017-11-14 03:19:57

标签: sql hive

我有一个数据集结构,例如下面存储在Hive中的数据集,称之为df:

+-----+-----+----------+--------+
| id1 | id2 |   date   | amount |
+-----+-----+----------+--------+
|   1 |   2 | 11-07-17 | 0.93   |
|   2 |   2 | 11-11-17 | 1.94   |
|   2 |   2 | 11-09-17 | 1.90   |
|   1 |   1 | 11-10-17 | 0.33   |
|   2 |   2 | 11-10-17 | 1.93   |
|   1 |   1 | 11-07-17 | 0.25   |
|   1 |   1 | 11-09-17 | 0.33   | 
|   1 |   1 | 11-12-17 | 0.33   |
|   2 |   2 | 11-08-17 | 1.90   |
|   1 |   1 | 11-08-17 | 0.30   |
|   2 |   2 | 11-12-17 | 2.01   |
|   1 |   2 | 11-12-17 | 1.00   |
|   1 |   2 | 11-09-17 | 0.94   |
|   2 |   2 | 11-07-17 | 1.94   |
|   1 |   2 | 11-11-17 | 1.92   |
|   1 |   1 | 11-11-17 | 0.33   |
|   1 |   2 | 11-10-17 | 1.92   |
|   1 |   2 | 11-08-17 | 0.94   |
+-----+-----+----------+--------+

我希望按id1和id2进行分区,然后按日期在id1和id2的每个分组中按降序排序,然后排名" amount"在那之内,相同的数量"连续几天将获得相同的排名。我希望看到的有序和排名输出显示在这里:

+-----+-----+------------+--------+------+
| id1 | id2 |    date    | amount | rank |
+-----+-----+------------+--------+------+
|   1 |   1 | 2017-11-12 | 0.33   |    1 |
|   1 |   1 | 2017-11-11 | 0.33   |    1 |
|   1 |   1 | 2017-11-10 | 0.33   |    1 |
|   1 |   1 | 2017-11-09 | 0.33   |    1 |
|   1 |   1 | 2017-11-08 | 0.30   |    2 |
|   1 |   1 | 2017-11-07 | 0.25   |    3 |
|   1 |   2 | 2017-11-12 | 1.00   |    1 |
|   1 |   2 | 2017-11-11 | 1.92   |    2 |
|   1 |   2 | 2017-11-10 | 1.92   |    2 |
|   1 |   2 | 2017-11-09 | 0.94   |    3 |
|   1 |   2 | 2017-11-08 | 0.94   |    3 |
|   1 |   2 | 2017-11-07 | 0.93   |    4 |
|   2 |   2 | 2017-11-12 | 2.01   |    1 |
|   2 |   2 | 2017-11-11 | 1.94   |    2 |
|   2 |   2 | 2017-11-10 | 1.93   |    3 |
|   2 |   2 | 2017-11-09 | 1.90   |    4 |
|   2 |   2 | 2017-11-08 | 1.90   |    4 |
|   2 |   2 | 2017-11-07 | 1.94   |    5 |
+-----+-----+------------+--------+------+

我尝试使用以下SQL查询:

SELECT 
    id1, 
    id2, 
    date, 
    amount,
    dense_rank() OVER (PARTITION BY id1, id2 ORDER BY date DESC) AS rank
FROM
    df
GROUP BY
    id1,
    id2,
    date,
    amount

但是,由于我没有收到我正在寻找的输出,因此该查询似乎没有按照我的意愿行事。

看起来像使用dense_rank的窗口函数,分区依据和order by是我需要的但是我似乎无法让它给我那些我想要的样本输出。任何帮助将非常感激!谢谢!

1 个答案:

答案 0 :(得分:2)

这非常棘手。我认为您需要使用lag()来查看值更改的位置,然后执行累计求和:

select df.*,
       sum(case when prev_amount = amount then 0 else 1 end) over
           (partition by id1, id2 order by date desc) as rank
from (select df.*,
             lag(amount) over (partition by id1, id2 order by date desc) as prev_amount
      from df
     ) df;