我有一个表,其表格式如下:
我在此表上有一个索引,如下所示:
CREATE INDEX time_index ON events_table USING btree (flagged_as_bot, flagged_as_spam, category, "timestamp") WHERE ((flagged_as_bot = false) AND (flagged_as_spam = false))
我对此表进行查询以生成折线图,以表示每天发生的事件数。但是,我希望根据用户的时区调整折线图。当前,我有一个如下查询:
SELECT
date_trunc('day', timestamp + INTERVAL '-5 hour') AS ts,
category,
COUNT(*) AS count
FROM
events_table
WHERE
category = 'the category'
AND flagged_as_bot = FALSE
AND flagged_as_spam = FALSE
AND timestamp >= '2018-05-04T00:00:00'::timestamp
AND timestamp < '2018-10-31T17:57:59.661664'::timestamp
GROUP BY
ts,
category
ORDER BY
1 ASC
在大多数情况下,对于少于100,000条记录的类别,这是相当快的:
GroupAggregate (cost=8908.56..8958.18 rows=1985 width=70) (actual time=752.886..753.301 rows=124 loops=1)
Group Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval))), category
-> Sort (cost=8908.56..8913.52 rows=1985 width=62) (actual time=752.878..752.983 rows=797 loops=1)
Sort Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval)))
Sort Method: quicksort Memory: 137kB
-> Bitmap Heap Scan on listens (cost=552.79..8799.83 rows=1985 width=62) (actual time=748.683..752.568 rows=797 loops=1)
Recheck Cond: ((category = '7248c3b8-727e-4357-a267-e9b0e3e36d4b'::bpchar) AND ("timestamp" >= '2018-05-04 00:00:00'::timestamp without time zone) AND ("timestamp" < '2018-10-31 17:57:59.661664'::timestamp without time zone))
Filter: ((NOT flagged_as_bot) AND (NOT flagged_as_spam))
Rows Removed by Filter: 1576
Heap Blocks: exact=1906
-> Bitmap Index Scan on time_index (cost=0.00..552.30 rows=2150 width=0) (actual time=748.324..748.324 rows=2373 loops=1)
Index Cond: ((category = '7248c3b8-727e-4357-a267-e9b0e3e36d4b'::bpchar) AND ("timestamp" >= '2018-05-04 00:00:00'::timestamp without time zone) AND ("timestamp" < '2018-10-31 17:57:59.661664'::timestamp without time zone))
Planning time: 0.628 ms
Execution time: 753.362 ms"
对于具有大量记录(> 100,000)的类别,不使用索引并且查询非常慢:
GroupAggregate (cost=1232229.95..1287491.60 rows=2126204 width=70) (actual time=14649.671..17178.955 rows=181 loops=1)
Group Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval))), category
-> Sort (cost=1232229.95..1238072.10 rows=2336859 width=62) (actual time=14643.887..16031.031 rows=3070695 loops=1)
Sort Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval)))
Sort Method: external merge Disk: 216200kB
-> Seq Scan on listens (cost=0.00..809314.38 rows=2336859 width=62) (actual time=0.015..9572.722 rows=3070695 loops=1)
Filter: ((NOT flagged_as_bot) AND (NOT flagged_as_spam) AND ("timestamp" >= '2018-05-04 00:00:00'::timestamp without time zone) AND ("timestamp" < '2018-10-31 17:57:59.661664'::timestamp without time zone) AND (category = '3b634b32-bb82-4f56-ada4-f4b7bc4288a5'::bpchar))
Rows Removed by Filter: 8788028
Planning time: 0.239 ms
Execution time: 17228.314 ms
我的假设是不使用该索引,因为使用索引的开销远远高于仅执行表扫描的开销。当然,我想这是因为使用date_trunc
来计算分组依据的日期。
我已经考虑过这里可以做什么。这是我的一些想法:
在我掉进兔子洞之前,我想我要出去看看是否有人有任何想法。