Postgres - 使用顺序扫描完成单个聚合,许多聚合索引扫描

时间:2016-11-10 15:43:30

标签: performance postgresql indexing aggregate-functions

我想查询一个表,并为该月最后一天的所有行汇总一列。

让我们使用下表作为示例:

CREATE TABLE example(dt date, value int, other1 int, other2 int, other3 int);
CREATE INDEX ON example (as_of);

我的查询如下:

SELECT dt, SUM(value)
FROM example
WHERE dt in (select date_trunc('month', d) + interval '1 month - 1 day'
             from generate_series('2012-01-01'::date, '2016-11-10'::date, interval '1 month') dates(d))
GROUP BY dt

如果我查看查询计划,我会看到它正在对表执行顺序扫描:

EXPLAIN ANALYSE SELECT dt, SUM(value)
FROM example
WHERE dt in (select date_trunc('month', d) + interval '1 month - 1 day'
             from generate_series('2012-01-01'::date, '2016-11-10'::date, interval '1 month') dates(d))
GROUP BY dt

 GroupAggregate  (cost=825385.12..871490.30 rows=1536 width=12) (actual time=4323.887..6141.401 rows=56 loops=1)
   Group Key: example.Dt
   ->  Merge Join  (cost=825385.12..863846.28 rows=1525732 width=12) (actual time=4323.811..6118.514 rows=101102 loops=1)
         Merge Cond: (example.dt = ((date_trunc('month'::text, dates.d) + '1 mon -1 days'::interval)))
         ->  Sort  (cost=825312.64..832941.30 rows=3051464 width=12) (actual time=4323.585..5303.902 rows=3051464 loops=1)
               Sort Key: example.dt
               Sort Method: external merge  Disk: 77512kB
               ->  Seq Scan on example  (cost=0.00..392353.64 rows=3051464 width=12) (actual time=10.385..1748.592 rows=3051464 loops=1)
         ->  Sort  (cost=72.48..72.98 rows=200 width=8) (actual time=0.168..18.248 rows=101105 loops=1)
               Sort Key: ((date_trunc('month'::text, dates.d) + '1 mon -1 days'::interval))
               Sort Method: quicksort  Memory: 27kB
               ->  Unique  (cost=59.84..64.84 rows=200 width=8) (actual time=0.108..0.143 rows=59 loops=1)
                     ->  Sort  (cost=59.84..62.34 rows=1000 width=8) (actual time=0.106..0.112 rows=59 loops=1)
                           Sort Key: ((date_trunc('month'::text, dates.d) + '1 mon -1 days'::interval))
                           Sort Method: quicksort  Memory: 27kB
                           ->  Function Scan on generate_series dates  (cost=0.01..10.01 rows=1000 width=8) (actual time=0.042..0.097 rows=59 loops=1)

但是,如果我向查询添加其他SUM,则它决定使用dt上的索引:

EXPLAIN ANALYSE SELECT dt, SUM(value), SUM(other1), SUM(other2), SUM(other3)
FROM example
WHERE dt in (select date_trunc('month', d) + interval '1 month - 1 day'
             from generate_series('2012-01-01'::date, '2016-11-10'::date, interval '1 month') dates(d))
GROUP BY dt

HashAggregate  (cost=1005765.17..1005780.53 rows=1536 width=61) (actual time=225.249..225.276 rows=56 loops=1)
   Group Key: l.as_of
   ->  Nested Loop  (cost=60.27..975250.53 rows=1525732 width=61) (actual time=0.141..173.853 rows=101102 loops=1)
         ->  Unique  (cost=59.84..64.84 rows=200 width=8) (actual time=0.100..0.192 rows=59 loops=1)
               ->  Sort  (cost=59.84..62.34 rows=1000 width=8) (actual time=0.099..0.125 rows=59 loops=1)
                     Sort Key: ((date_trunc('month'::text, dates.d) + '1 mon -1 days'::interval))
                     Sort Method: quicksort  Memory: 27kB
                     ->  Function Scan on generate_series dates  (cost=0.01..10.01 rows=1000 width=8) (actual time=0.031..0.080 rows=59 loops=1)
         ->  Index Scan using dashboard_loanhistory_95daa586 on dashboard_loanhistory l  (cost=0.43..4856.06 rows=1987 width=61) (actual time=0.025..1.579 rows=1714 loops=59)
               Index Cond: (as_of = (date_trunc('month'::text, dates.d) + '1 mon -1 days'::interval))
 Planning time: 0.228 ms
 Execution time: 225.379 ms

这里发生了什么?我希望使用dt上的索引运行原始查询,我不希望不必要地向查询添加其他聚合。

1 个答案:

答案 0 :(得分:0)

这是基于该问题的评论,特别是@joops答案。这有点像黑客,因为它需要另一个索引 - 我真的不明白为什么查询规划器不会在dt这里使用它但是这个工作¯\ _(ツ)_ /¯

我在dt列上添加了部分索引,将其限制为可能是一个月中最后一天的那些天:

CREATE INDEX ON example (dt) WHERE date_part('day', dt) IN (28, 29, 30, 31);

然后我改变了我的查询,以在该月的当天包含一个谓词:

SELECT dt, SUM(value)
FROM example
WHERE date_part('day', dt) IN (28, 29, 30, 31) AND
      dt IN (select date_trunc('month', d) + interval '1 month - 1 day' from generate_series('2012-01-01'::date, '2016-11-10'::date, interval '1 month') dates(d));