查询仅涉及一个具有定义的表act
:
┌────────────────┬──────────────┬───────────┐
│ Column │ Type │ Modifiers │
├────────────────┼──────────────┼───────────┤
│ loan_id │ bigint │ not null │
│ dt │ date │ not null │
│ dlq_stat │ character(3) │ │
└────────────────┴──────────────┴───────────┘
Indexes:
"act_pkey" PRIMARY KEY, btree (loan_id, dt)
act
包含约20.7M
条记录和611K
唯一loan_id
条。预期查询的想法是找到每个dt
的最后loan_id
,然后一起收集这些记录。
with maxdt as
(
select loan_id, max(dt) as maxdt
from act
group by loan_id
)
select maxdt.loan_id,
maxdt.maxdt as lastdt,
act.dlq_stat
from maxdt
join act
on maxdt.loan_id = act.loan_id
and maxdt.maxdt = act.dt;
使用查询计划:
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop (cost=527126.77..1105840.68 rows=103725 width=21) (actual time=28608.305..5754855.397 rows=611347 loops=1) │
│ CTE maxdt │
│ -> HashAggregate (cost=526395.38..527126.21 rows=73083 width=12) (actual time=28599.889..29572.679 rows=611347 loops=1) │
│ Group Key: act.loan_id │
│ -> Seq Scan on act (cost=0.00..422669.92 rows=20745092 width=12) (actual time=10.527..24174.317 rows=20744968 loops=1) │
│ -> CTE Scan on maxdt (cost=0.00..1461.66 rows=73083 width=12) (actual time=28599.893..30635.047 rows=611347 loops=1) │
│ -> Index Scan using act_pkey on act act (cost=0.56..7.87 rows=2 width=21) (actual time=9.347..9.358 rows=1 loops=611347) │
│ Index Cond: ((loan_id = maxdt.loan_id) AND (dt = maxdt.maxdt)) │
│ Planning time: 0.275 ms │
│ Execution time: 5755033.052 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
with maxdt as
(
select loan_id, max(dt) as maxdt
from act
group by loan_id
order by loan_id /* order is forced to take advantage of locality of the index */
)
select maxdt.loan_id,
maxdt.maxdt as lastdt,
act.dlq_stat
from maxdt
join act
on maxdt.loan_id = act.loan_id
and maxdt.maxdt = act.dt;
使用查询计划:
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Nested Loop (cost=533213.58..1111927.48 rows=103725 width=21) (actual time=19036.526..57610.487 rows=611347 loops=1) │
│ CTE maxdt │
│ -> Sort (cost=533030.31..533213.02 rows=73083 width=12) (actual time=19036.484..19101.752 rows=611347 loops=1) │
│ Sort Key: act.loan_id │
│ Sort Method: quicksort Memory: 53233kB │
│ -> HashAggregate (cost=526395.38..527126.21 rows=73083 width=12) (actual time=18699.754..18857.285 rows=611347 loops=1) │
│ Group Key: act.loan_id │
│ -> Seq Scan on act (cost=0.00..422669.92 rows=20745092 width=12) (actual time=0.005..14234.374 rows=20744968 loops=1) │
│ -> CTE Scan on maxdt (cost=0.00..1461.66 rows=73083 width=12) (actual time=19036.489..19351.742 rows=611347 loops=1) │
│ -> Index Scan using act_pkey on act act (cost=0.56..7.87 rows=2 width=21) (actual time=0.061..0.061 rows=1 loops=611347) │
│ Index Cond: ((loan_id = maxdt.loan_id) AND (dt = maxdt.maxdt)) │
│ Planning time: 0.217 ms │
│ Execution time: 57662.761 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
根据@Laurenz的建议,Query 3使用Hash Anti Join来完成工作。
explain analyze
select loan_id, dt, dlq_stat
from act
where not exists (
select NULL
from act as a
where a.loan_id = act.loan_id
and a.dt > act.dt
);
使用查询计划:
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Hash Anti Join (cost=783278.57..2036132.40 rows=13830061 width=21) (actual time=88983.113..295036.106 rows=611347 loops=1) │
│ Hash Cond: (act.loan_id = a.loan_id) │
│ Join Filter: (a.dt > act.dt) │
│ Rows Removed by Join Filter: 20744968 │
│ -> Seq Scan on act (cost=0.00..422669.92 rows=20745092 width=21) (actual time=0.004..84807.803 rows=20744968 loops=1) │
│ -> Hash (cost=422669.92..422669.92 rows=20745092 width=12) (actual time=88937.978..88937.978 rows=20744968 loops=1) │
│ Buckets: 4194304 Batches: 16 Memory Usage: 89089kB │
│ -> Seq Scan on act a (cost=0.00..422669.92 rows=20745092 width=12) (actual time=0.007..85665.452 rows=20744968 loops=1)│
│ Planning time: 129.782 ms │
│ Execution time: 295058.407 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
为两个查询选择带索引扫描的嵌套循环。然而,第一个查询非常慢:如果数据很冷,它可能会运行超过1小时。虽然第二个查询通过loan_id
上的强制顺序利用索引位置来提高效率。它通常可在1分钟内完成。
根据我的观察,主要问题是第一个查询大大低估了索引扫描的成本......这是许多postgresql性能问题的主要原因。如果未订购驱动器表maxdt
,则索引扫描的成本不会高于seq_page_cost
,而是random_page_cost
。
查询3也是一种合理的方法,但其性能略差于查询2.我不知道如何将 Hash Anti Join 与 Hash Join 进行比较,鉴于连接条件也不同。
最后,这里有两个问题:
Hash Join
? 谢谢!
答案 0 :(得分:2)
当您说索引扫描的成本被低估时,您指的是cost=0.56..7.87
吗?这是单个索引扫描的成本,总估计成本是该值乘以估计的循环计数73083
后一个值实际上低了8.4倍,可能可以通过更高的default_statistics_target
设置进行改善。
我建议您像这样编写查询:
SELECT loan_id, dt, dlq_stat
FROM act
WHERE NOT EXISTS (
SELECT NULL
FROM act a
WHERE a.loan_id = act.loan_id AND a.dt > act.dt
);
这应该会导致像
这样的查询计划Hash Anti Join
Hash Cond: (act.loan_id = a.loan_id)
Join Filter: (a.dt > act.dt)
-> Seq Scan on act
-> Hash
-> Seq Scan on act a
答案 1 :(得分:1)
您可以尝试Postgres特定的DISTINCT ON
clause.
SELECT DISTINCT ON (loan_id) loan_id, dt AS lastdt, dlq_stat
FROM act
ORDER BY loan_id, dt DESC;
我不知道PostgreSQL是否足够聪明,可以使用该查询的索引。将索引重定为(loan_id, dt DESC)
而不是普通(loan_id, dt)
可能有所帮助。