Question

查询仅涉及一个具有定义的表act：

┌────────────────┬──────────────┬───────────┐
│     Column     │     Type     │ Modifiers │
├────────────────┼──────────────┼───────────┤
│ loan_id        │ bigint       │ not null  │
│ dt             │ date         │ not null  │
│ dlq_stat       │ character(3) │           │
└────────────────┴──────────────┴───────────┘
Indexes:
    "act_pkey" PRIMARY KEY, btree (loan_id, dt)

act包含约20.7M条记录和611K唯一loan_id条。预期查询的想法是找到每个dt的最后loan_id，然后一起收集这些记录。

查询1

with maxdt as
(
  select loan_id, max(dt) as maxdt
    from act
   group by loan_id
)
select maxdt.loan_id,
       maxdt.maxdt as lastdt,
       act.dlq_stat
  from maxdt
  join act
    on maxdt.loan_id = act.loan_id
   and maxdt.maxdt = act.dt;

使用查询计划：

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                 QUERY PLAN                                                              │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop  (cost=527126.77..1105840.68 rows=103725 width=21) (actual time=28608.305..5754855.397 rows=611347 loops=1)                 │
│   CTE maxdt                                                                                                                             │
│     ->  HashAggregate  (cost=526395.38..527126.21 rows=73083 width=12) (actual time=28599.889..29572.679 rows=611347 loops=1)           │
│           Group Key: act.loan_id                                                                                                        │
│           ->  Seq Scan on act  (cost=0.00..422669.92 rows=20745092 width=12) (actual time=10.527..24174.317 rows=20744968 loops=1)      │
│   ->  CTE Scan on maxdt  (cost=0.00..1461.66 rows=73083 width=12) (actual time=28599.893..30635.047 rows=611347 loops=1)                │
│   ->  Index Scan using act_pkey on act act  (cost=0.56..7.87 rows=2 width=21) (actual time=9.347..9.358 rows=1 loops=611347)            │
│         Index Cond: ((loan_id = maxdt.loan_id) AND (dt = maxdt.maxdt))                                                                  │
│ Planning time: 0.275 ms                                                                                                                 │
│ Execution time: 5755033.052 ms                                                                                                          │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

查询2

with maxdt as
(
  select loan_id, max(dt) as maxdt
    from act
   group by loan_id
   order by loan_id   /* order is forced to take advantage of locality of the index */
)
select maxdt.loan_id,
       maxdt.maxdt as lastdt,
       act.dlq_stat
  from maxdt
  join act
    on maxdt.loan_id = act.loan_id
   and maxdt.maxdt = act.dt;

使用查询计划：

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN                                                            │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop  (cost=533213.58..1111927.48 rows=103725 width=21) (actual time=19036.526..57610.487 rows=611347 loops=1)                   │
│   CTE maxdt                                                                                                                             │
│     ->  Sort  (cost=533030.31..533213.02 rows=73083 width=12) (actual time=19036.484..19101.752 rows=611347 loops=1)                    │
│           Sort Key: act.loan_id                                                                                                         │
│           Sort Method: quicksort  Memory: 53233kB                                                                                       │
│           ->  HashAggregate  (cost=526395.38..527126.21 rows=73083 width=12) (actual time=18699.754..18857.285 rows=611347 loops=1)     │
│                 Group Key: act.loan_id                                                                                                  │
│                 ->  Seq Scan on act  (cost=0.00..422669.92 rows=20745092 width=12) (actual time=0.005..14234.374 rows=20744968 loops=1) │
│   ->  CTE Scan on maxdt  (cost=0.00..1461.66 rows=73083 width=12) (actual time=19036.489..19351.742 rows=611347 loops=1)                │
│   ->  Index Scan using act_pkey on act act  (cost=0.56..7.87 rows=2 width=21) (actual time=0.061..0.061 rows=1 loops=611347)            │
│         Index Cond: ((loan_id = maxdt.loan_id) AND (dt = maxdt.maxdt))                                                                  │
│ Planning time: 0.217 ms                                                                                                                 │
│ Execution time: 57662.761 ms                                                                                                            │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

查询3

根据@Laurenz的建议，Query 3使用Hash Anti Join来完成工作。

explain analyze
select loan_id, dt, dlq_stat
  from act
 where not exists (
         select NULL
           from act as a
          where a.loan_id = act.loan_id
            and a.dt > act.dt
       );

使用查询计划：

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                QUERY PLAN                                                        │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Hash Anti Join  (cost=783278.57..2036132.40 rows=13830061 width=21) (actual time=88983.113..295036.106 rows=611347 loops=1)      │
│   Hash Cond: (act.loan_id = a.loan_id)                                                                                           │
│   Join Filter: (a.dt > act.dt)                                                                                                   │
│   Rows Removed by Join Filter: 20744968                                                                                          │
│   ->  Seq Scan on act  (cost=0.00..422669.92 rows=20745092 width=21) (actual time=0.004..84807.803 rows=20744968 loops=1)        │
│   ->  Hash  (cost=422669.92..422669.92 rows=20745092 width=12) (actual time=88937.978..88937.978 rows=20744968 loops=1)          │
│         Buckets: 4194304  Batches: 16  Memory Usage: 89089kB                                                                     │
│         ->  Seq Scan on act a  (cost=0.00..422669.92 rows=20745092 width=12) (actual time=0.007..85665.452 rows=20744968 loops=1)│
│ Planning time: 129.782 ms                                                                                                        │
│ Execution time: 295058.407 ms                                                                                                    │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

观察和问题

为两个查询选择带索引扫描的嵌套循环。然而，第一个查询非常慢：如果数据很冷，它可能会运行超过1小时。虽然第二个查询通过loan_id上的强制顺序利用索引位置来提高效率。它通常可在1分钟内完成。

根据我的观察，主要问题是第一个查询大大低估了索引扫描的成本......这是许多postgresql性能问题的主要原因。如果未订购驱动器表maxdt，则索引扫描的成本不会高于seq_page_cost，而是random_page_cost。

查询3也是一种合理的方法，但其性能略差于查询2.我不知道如何将 Hash Anti Join 与 Hash Join 进行比较，鉴于连接条件也不同。

最后，这里有两个问题：

我们是否可以在postgresql查询构建器中构建这样的逻辑？了解它是一个很长的镜头。
是否有一种很好的方法可以强制上述查询使用Hash Join？

谢谢！

Answer 1

当您说索引扫描的成本被低估时，您指的是cost=0.56..7.87吗？这是单个索引扫描的成本，总估计成本是该值乘以估计的循环计数73083 后一个值实际上低了8.4倍，可能可以通过更高的default_statistics_target设置进行改善。

我建议您像这样编写查询：

SELECT loan_id, dt, dlq_stat
FROM act
WHERE NOT EXISTS (
   SELECT NULL
   FROM act a
   WHERE a.loan_id = act.loan_id AND a.dt > act.dt
);

这应该会导致像

这样的查询计划

Hash Anti Join
  Hash Cond: (act.loan_id = a.loan_id)
  Join Filter: (a.dt > act.dt)
  ->  Seq Scan on act
  ->  Hash
        ->  Seq Scan on act a

Answer 2

您可以尝试Postgres特定的DISTINCT ON clause.

SELECT DISTINCT ON (loan_id) loan_id, dt AS lastdt, dlq_stat
FROM act
ORDER BY loan_id, dt DESC;

我不知道PostgreSQL是否足够聪明，可以使用该查询的索引。将索引重定为(loan_id, dt DESC)而不是普通(loan_id, dt)可能有所帮助。

Postgresql：错误的索引扫描成本估算导致错误的计划

查询1

查询2

查询3

观察和问题

2 个答案: