如何优化PostgreSQL排行榜窗口函数查询

时间:2017-07-30 16:52:19

标签: sql postgresql ranking window-functions leaderboard

在我们的API中,我们有一个基本的排名/排行榜功能,每个客户端用户都有自己可以执行的“操作”列表,每个操作都会得到一个分数,所有操作都记录在“操作”表中,然后记录在每个用户中可以要求当月的排行榜(每个月排行榜重置)。没什么好看的。

我们有两个表:包含用户的表和包含操作的表(我删除了非相关列):

> \d client_users
                                           Table "public.client_users"
         Column         |            Type             |                         Modifiers
------------------------+-----------------------------+-----------------------------------------------------------
 id                     | integer                     | not null default nextval('client_users_id_seq'::regclass)
 app_id                 | integer                     |
 user_id                | character varying           | not null
 created_at             | timestamp without time zone |
 updated_at             | timestamp without time zone |
Indexes:
    "client_users_pkey" PRIMARY KEY, btree (id)
    "index_client_users_on_app_id" btree (app_id)
    "index_client_users_on_user_id" btree (user_id)
Foreign-key constraints:
    "client_users_app_id_fk" FOREIGN KEY (app_id) REFERENCES apps(id)
Referenced by:
    TABLE "leaderboard_actions" CONSTRAINT "leaderboard_actions_client_user_id_fk" FOREIGN KEY (client_user_id) REFERENCES client_users(id)

> \d leaderboard_actions
                                       Table "public.leaderboard_actions"
     Column     |            Type             |                            Modifiers
----------------+-----------------------------+------------------------------------------------------------------
 id             | integer                     | not null default nextval('leaderboard_actions_id_seq'::regclass)
 client_user_id | integer                     |
 score          | integer                     | not null default 0
 created_at     | timestamp without time zone |
 updated_at     | timestamp without time zone |
Indexes:
    "leaderboard_actions_pkey" PRIMARY KEY, btree (id)
    "index_leaderboard_actions_on_client_user_id" btree (client_user_id)
    "index_leaderboard_actions_on_created_at" btree (created_at)
Foreign-key constraints:
    "leaderboard_actions_client_user_id_fk" FOREIGN KEY (client_user_id) REFERENCES client_users(id)

我想要优化的查询如下:

SELECT
  cu.user_id,
  SUM(la.score) AS total_score,
  rank() OVER (ORDER BY SUM(la.score) DESC) AS ranking
FROM client_users cu
JOIN leaderboard_actions la ON cu.id = la.client_user_id
WHERE cu.app_id = 8
AND la.created_at BETWEEN '2017-07-01 00:00:00.000000' AND '2017-07-31 23:59:59.999999'
GROUP BY cu.id
ORDER BY total_score DESC
LIMIT 20;

注意:client_users.user_id是一个varchar“人类id”,这些表与client_user.id上的外键连接(命名不是很好,我知道:D)

基本上我要求PostgreSQL根据当月个人行为得分的总和给我排名前20位的用户。

从查询计划中可以看出并不那么快:

Limit  (cost=8641.96..8642.05 rows=20 width=52) (actual time=135.544..135.560 rows=20 loops=1)
 Output: cu.user_id, (sum(la.score)), (rank() OVER (?)), cu.id
 ->  WindowAgg  (cost=8641.96..8841.42 rows=44326 width=52) (actual time=135.543..135.559 rows=20 loops=1)
       Output: cu.user_id, (sum(la.score)), rank() OVER (?), cu.id
       ->  Sort  (cost=8641.96..8664.12 rows=44326 width=44) (actual time=135.538..135.539 rows=20 loops=1)
             Output: (sum(la.score)), cu.id, cu.user_id
             Sort Key: (sum(la.score)) DESC
             Sort Method: quicksort  Memory: 1451kB
             ->  HashAggregate  (cost=7824.77..7957.75 rows=44326 width=44) (actual time=130.938..133.124 rows=10411 loops=1)
                   Output: sum(la.score), cu.id, cu.user_id
                   Group Key: cu.id
                   ->  Hash Join  (cost=5858.66..7780.44 rows=44326 width=40) (actual time=50.849..111.346 rows=79382 loops=1)
                         Output: cu.id, cu.user_id, la.score
                         Hash Cond: (la.client_user_id = cu.id)
                         ->  Index Scan using index_leaderboard_actions_on_created_at on public.leaderboard_actions la  (cost=0.09..1736.77 rows=69494 width=8) (actual time=0.020..33.773 rows=79382 loops=1)
                               Output: la.id, la.client_user_id, la.rule_id, la.score, la.created_at, la.updated_at, la.success
                               Index Cond: ((la.created_at >= '2017-07-01 00:00:00'::timestamp without time zone) AND (la.created_at <= '2017-07-31 23:59:59.999999'::timestamp without time zone))
                         ->  Hash  (cost=5572.11..5572.11 rows=81846 width=36) (actual time=50.330..50.330 rows=81859 loops=1)
                               Output: cu.user_id, cu.id
                               Buckets: 131072  Batches: 1  Memory Usage: 6583kB
                               ->  Seq Scan on public.client_users cu  (cost=0.00..5572.11 rows=81846 width=36) (actual time=0.014..34.539 rows=81859 loops=1)
                                     Output: cu.user_id, cu.id
                                     Filter: (cu.app_id = 8)
                                     Rows Removed by Filter: 46610
Planning time: 1.276 ms
Execution time: 136.176 ms
(26 rows)

为了让您了解尺寸:

  • client_users大约有128471行,查询只有81860( app_id = 8
  • leaderboard_actions在当月有1609992行和79435

有什么想法吗?

谢谢!

1 个答案:

答案 0 :(得分:1)

您获得的计划实际上超过合理快。

您可以使用另外几个索引帮助您的计划:

CREATE INDEX idx_client_users_app_id_user  
    ON client_users(app_id, id, user_id) ;

CREATE INDEX idx_leaderboard_actions_3 
    ON leaderboard_actions(created_at, client_user_id, score) ;

创建两个索引后,执行

VACUUM ANALYZE client_users;
VACUUM ANALYZE leaderboard_actions;

这些索引将允许(最有可能)执行只读取它们的查询(而不是表client_usersleaderboard_actions)。所有需要的信息已经存在。该计划应显示一些Index Only Scan

您可以在 dbfiddle here 中找到您的方案的 模拟 执行时间缩短了30%。您的 实际 方案可能会有类似的改进。