如何使用DISTINCT ON和JOIN许多值来优化SQL查询?

时间:2019-07-01 10:27:59

标签: sql postgresql query-optimization

我有一个类似这样的查询,其中联接〜6000个值

SELECT DISTINCT ON(user_id)
                user_id,
                finished_at as last_deposit_date,
                CASE When currency = 'RUB' Then amount_cents  END as last_deposit_amount_cents
            FROM    payments
            JOIN (VALUES (5),(22),(26)) --~6000 values
            AS v(user_id) USING (user_id)
            WHERE action = 'deposit' 
                AND success = 't'
                AND currency IN ('RUB')
            ORDER BY user_id, finished_at DESC

用于查询具有许多值的查询计划:

Unique  (cost=444606.97..449760.44 rows=19276 width=24) (actual time=6129.403..6418.317 rows=5991 loops=1)
  Buffers: shared hit=2386527, temp read=7807 written=7808
  ->  Sort  (cost=444606.97..447183.71 rows=1030695 width=24) (actual time=6129.401..6295.457 rows=1877039 loops=1)
        Sort Key: payments.user_id, payments.finished_at DESC
        Sort Method: external merge  Disk: 62456kB
        Buffers: shared hit=2386527, temp read=7807 written=7808
        ->  Nested Loop  (cost=0.43..341665.35 rows=1030695 width=24) (actual time=0.612..5085.376 rows=1877039 loops=1)
              Buffers: shared hit=2386521
              ->  Values Scan on "*VALUES*"  (cost=0.00..75.00 rows=6000 width=4) (actual time=0.002..4.507 rows=6000 loops=1)
              ->  Index Scan using index_payments_on_user_id on payments  (cost=0.43..54.78 rows=172 width=28) (actual time=0.010..0.793 rows=313 loops=6000)
                    Index Cond: (user_id = "*VALUES*".column1)
                    Filter: (success AND ((action)::text = 'deposit'::text) AND ((currency)::text = 'RUB'::text))
                    Rows Removed by Filter: 85
                    Buffers: shared hit=2386521
Planning time: 5.886 ms
Execution time: 6429.685 ms

我使用PosgreSQL 10.8.0。有没有机会加快此查询的速度?

我尝试用递归替换DISTINCT:

WITH RECURSIVE t AS (
 (SELECT min(user_id) AS user_id FROM payments)
 UNION ALL
 SELECT (SELECT min(user_id) FROM payments  
 WHERE user_id > t.user_id      
 ) AS user_id FROM
t   
  WHERE t.user_id IS NOT NULL
 )
SELECT payments.* FROM t
JOIN (VALUES (5),(22),(26)) --~6000 VALUES
AS v(user_id) USING (user_id)
, LATERAL (
 SELECT user_id,
        finished_at as last_deposit_date,
        CASE When currency = 'RUB' Then amount_cents  END as last_deposit_amount_cents FROM payments            
        WHERE payments.user_id=t.user_id
            AND action = 'deposit' 
        AND success = 't'
        AND currency IN ('RUB')     
        ORDER BY finished_at DESC LIMIT 1
) AS payments

WHERE t.user_id IS NOT NULL;

但是事实证明,它甚至更慢

  

哈希联接(成本= 418.67..21807.22行= 3000宽度= 24)(实际时间= 16.804..10843.174行= 5991循环= 1)     哈希值:(t.user_id =“ VALUES ”。column1)     缓冲区:共享命中= 6396763     CTE吨       ->递归联合(成本= 0.46..53.73行= 101宽度= 8)(实际时间= 0.142..1942.351行= 237029循环= 1)             缓冲区:共享命中= 864281             ->结果(成本= 0.46..0.47行= 1宽度= 8)(实际时间= 0.141..0.142行= 1循环= 1)                   缓冲区:共享命中= 4                   InitPlan 3(返回$ 1)                     ->限制(费用= 0.43..0.46行= 1宽度= 8)(实际时间= 0.138..0.139行= 1循环= 1)                           缓冲区:共享命中= 4                           ->仅索引扫描使用index_payments_on_user_id上的付款payment_2(成本= 0.43..155102.74行= 4858092宽度= 8)(实际时间= 0.137..0.138行= 1循环= 1)                                 索引条件:(user_id不为空)                                 堆获取:0                                 缓冲区:共享命中= 4             ->在t t_1上进行工作表扫描(成本= 0.00..5.12行= 10宽度= 8)(实际时间= 0.008..0.008行= 1循环= 237029)                   过滤器:(user_id不为NULL)                   筛选器删除的行:0                   缓冲区:共享命中= 864277                   子计划2                     ->结果(成本= 0.48..0.49行= 1宽度= 8)(实际时间= 0.007..0.007行= 1循环= 237028)                           缓冲区:共享命中= 864277                           InitPlan 1(返回$ 3)                             ->限制(费用= 0.43..0.48行= 1宽度= 8)(实际时间= 0.007..0.007行= 1循环= 237028)                                   缓冲区:共享命中= 864277                                   ->仅索引扫描使用index_payments_on_user_id上的付款payment_1(成本= 0.43..80786.25行= 1619364宽度= 8)(实际时间= 0.007..0.007行= 1循环= 237028)                                         索引条件:((user_id不为空)AND(user_id> t_1.user_id))                                         堆访存量:46749                                         缓冲区:共享命中= 864277     ->嵌套循环(成本= 214.94..21498.23行= 100宽度= 32)(实际时间= 0.475..10794.535行= 167333循环= 1)           缓冲区:共享命中= 6396757           ->在t上进行CTE扫描(成本= 0.00..2.02行= 100宽度= 8)(实际时间= 0.145..1998.788行= 237028循环= 1)                 过滤器:(user_id不为NULL)                 筛选器删除的行:1                 缓冲区:共享命中= 864281           ->限制(cost = 214.94..214.94行= 1宽度= 24)(实际时间= 0.037..0.037行= 1循环= 237028)                 缓冲区:共享命中= 5532476                 ->排序(成本= 214.94..215.37行= 172宽度= 24)(实际时间= 0.036..0.036行= 1循环= 237028)                       排序关键字:payment.finished_at DESC                       排序方式:quicksort内存:25kB                       缓冲区:共享命中= 5532476                       ->使用index_payments_on_user_id进行支付时的索引扫描(成本= 0.43..214.08行= 172宽度= 24)(实际时间= 0.003..0.034行= 15循环= 237028)                             索引条件:(user_id = t.user_id)                             过滤器:(成功AND((操作)::文本='存款'::文本)AND((货币)::文本='RUB'::文本))                             筛选器删除的行:6                             缓冲区:共享命中= 5532473     ->哈希(成本= 75.00..75.00行= 6000宽度= 4)(实际时间= 2.255..2.255行= 6000循环= 1)           存储桶:8192批次:1内存使用量:275kB           ->在“ VALUES ”上扫描值(成本= 0.00..75.00行= 6000宽度= 4)(实际时间= 0.004..1.206行= 6000循环= 1)   计划时间:7.029毫秒   执行时间:10846.774 ms

1 个答案:

答案 0 :(得分:1)

对于此查询:

SELECT DISTINCT ON (user_id)
       p.user_id,
       p.finished_at as last_deposit_date,
       (CASE WHEN p.currency = 'RUB' THEN p.amount_cents  END) as last_deposit_amount_cents
FROM payments p JOIN
     (VALUES (5),( 22), (26) --~6000 values
     ) v(user_id)
     USING (user_id)
WHERE p.action = 'deposit' AND
      p.success = 't' ND
      p.currency = 'RUB'
ORDER BY p.user_id, p.finished_at DESC;

我不完全理解CASE表达式,因为WHERE正在过滤掉所有其他值。

也就是说,我希望(action, success, currency, user_id, finished_at desc)上的索引会有所帮助。