在Redshift中,有没有办法过滤最接近给定值的记录?

时间:2017-06-21 19:11:04

标签: sql amazon-redshift

我正在使用PERCENT_RANK()函数来获取给定数据集的百分位度量。这是查询:

WITH time_values AS (
    SELECT
        var,
        (end_time - start_time) * 1.0 / 3600000000 AS num_hours,
        PERCENT_RANK() OVER (PARTITION BY var1 ORDER BY num_hours) AS pct_rank
    FROM table
    WHERE
        start_time >= 1493596800000000
        AND end_time < 1493683200000000
)
SELECT
    var,
    pct_rank,
    num_hours
FROM time_values
WHERE pct_rank IN (0.25, 0.5, 0.8, 0.99)
ORDER BY 1, 2;

但是,考虑到PERCENT_RANK()的工作方式,我不会为我关心的每个百分位数得到完全匹配,因此输出结果如下:

 var | pct_rank |    num_hours
-----+----------+------------------
   a |     0.25 |  31.752826672222
   a |      0.5 | 171.844016125555
   b |     0.25 | 230.704589953055
   b |      0.5 | 246.269648327222

我正在寻找一种方法来返回我关心的每个百分位数的值,或者如果找不到完全匹配则返回最接近百分位数的值。这可行吗?

1 个答案:

答案 0 :(得分:2)

您可以对记录进行排名,然后在百分位截止值之前选择最大值:

WITH time_values AS (
    SELECT
        var,
        (end_time - start_time) * 1.0 / 3600000000 AS num_hours,
        row_number() OVER (PARTITION BY var1 ORDER BY num_hours) AS rank,
        count(1) OVER (PARTITION BY var1) AS records
    FROM table
    WHERE
        start_time >= 1493596800000000
        AND end_time < 1493683200000000
)
SELECT
    var,
    max(case when 1.0*rank/count<0.25 then num_hours end) as percentile_25,
    max(case when 1.0*rank/count<0.50 then num_hours end) as percentile_50,
    max(case when 1.0*rank/count<0.80 then num_hours end) as percentile_80,
    max(case when 1.0*rank/count<0.99 then num_hours end) as percentile_99
FROM time_values
ORDER BY 1;

或对PERCENT_RANK()输出执行相同的操作,如果您确实希望按列顺序排列输出,则只需将最后一步结果合并以获得所需的结构