Question

我有一个交易表，想添加一个百分位数列，该列根据金额列指定该月该交易的百分位数。

以四分位数而不是百分位数为例：

示例输入：

id | month | amount
1  |   1   |   1
2  |   1   |   2
3  |   1   |   5
4  |   1   |   3
5  |   2   |   1
6  |   2   |   3
1  |   2   |   5
1  |   2   |   7
1  |   2   |   9
1  |   2   |   11
1  |   2   |   15
1  |   2   |   16

示例输出

id | month | amount |  quartile
1  |   1   |   1    |      25
2  |   1   |   2    |      50
3  |   1   |   5    |      100
4  |   1   |   3    |      75
5  |   2   |   1    |      25
6  |   2   |   3    |      25
1  |   2   |   5    |      50
1  |   2   |   15   |      100
1  |   2   |   9    |      75
1  |   2   |   11   |      75
1  |   2   |   7    |      50
1  |   2   |   16   |      100

当前，我使用postgres的percentile_cont函数确定不同百分位数的截止点的数量值，然后仔细检查并确实更新百分位数列。不幸的是，这种方法太慢了，因为我有很多不同的月份。有关如何更快地执行此操作的任何想法，最好将百分比的计算和更新合并在一个SQL语句中。

我的代码：

num_buckets = 10

for i in range(num_buckets):
    decimal_percentile = (i+1)*(1.0/num_buckets)
    prev_decimal_percentile = i*1.0/num_buckets
    percentile = int(decimal_percentile*100)
    cursor.execute("SELECT month, 
                           percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC), 
                           percentile_cont(%s) WITHIN GROUP (ORDER BY amount ASC) 
                     FROM transactions GROUP BY month;", 
                     (prev_decimal_percentile, decimal_percentile))
    iter_cursor = connection.cursor()
    for data in cursor:
        iter_cursor.execute("UPDATE transactions SET percentile=%s 
                             WHERE month = %s 
                                   AND amount >= %s AND amount <= %s;", 
                            (percentile, data[0], data[1], data[2]))

Answer 1

您可以在单个查询中执行此操作，例如4个存储桶：

update transactions t
set percentile = calc_percentile
from (
    select distinct on (month, amount) 
        id, 
        month, 
        amount, 
        calc_percentile
    from transactions
    join (
        select 
            bucket,
            month as calc_month, 
            percentile_cont(bucket*1.0/4) within group (order by amount asc) as calc_amount,
            bucket*100/4 as calc_percentile
        from transactions 
        cross join generate_series(1, 4) bucket
        group by month, bucket
        ) s on month = calc_month and amount <= calc_amount
    order by month, amount, calc_percentile 
    ) s
where t.month = s.month and t.amount = s.amount;

结果：

select *
from transactions
order by month, amount;

 id | month | amount | percentile 
----+-------+--------+------------
  1 |     1 |      1 |         25
  2 |     1 |      2 |         50
  4 |     1 |      3 |         75
  3 |     1 |      5 |        100
  5 |     2 |      1 |         25
  6 |     2 |      3 |         25
  1 |     2 |      5 |         50
  1 |     2 |      7 |         50
  1 |     2 |      9 |         75
  1 |     2 |     11 |         75
  1 |     2 |     15 |        100
  1 |     2 |     16 |        100
(12 rows)

顺便说一句，id应该是主键，然后可以在联接中使用以获得更好的性能。

DbFiddle.

Python Postgres设置列为百分数

1 个答案: