按日期有效计数行,并根据时区进行调整

时间:2018-10-31 13:37:10

标签: postgresql postgresql-9.3 postgresql-performance

我有一个表,其表格式如下:

  • id(uuid; pk)
  • 时间戳(时间戳)
  • 类别(bpchar)
  • flaged_as_spam(布尔)
  • flagged_as_bot(布尔)
  • ...(其他元数据)

我在此表上有一个索引,如下所示:

CREATE INDEX time_index ON events_table USING btree (flagged_as_bot, flagged_as_spam, category, "timestamp") WHERE ((flagged_as_bot = false) AND (flagged_as_spam = false))

我对此表进行查询以生成折线图,以表示每天发生的事件数。但是,我希望根据用户的时区调整折线图。当前,我有一个如下查询:

SELECT
    date_trunc('day', timestamp + INTERVAL '-5 hour') AS ts,
    category,
    COUNT(*) AS count
FROM
    events_table
WHERE
    category = 'the category'
    AND flagged_as_bot = FALSE
    AND flagged_as_spam = FALSE
    AND timestamp >= '2018-05-04T00:00:00'::timestamp
    AND timestamp < '2018-10-31T17:57:59.661664'::timestamp
GROUP BY
    ts,
    category
ORDER BY
    1 ASC

在大多数情况下,对于少于100,000条记录的类别,这是相当快的:

GroupAggregate  (cost=8908.56..8958.18 rows=1985 width=70) (actual time=752.886..753.301 rows=124 loops=1)
  Group Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval))), category
  ->  Sort  (cost=8908.56..8913.52 rows=1985 width=62) (actual time=752.878..752.983 rows=797 loops=1)
        Sort Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval)))
        Sort Method: quicksort  Memory: 137kB
        ->  Bitmap Heap Scan on listens  (cost=552.79..8799.83 rows=1985 width=62) (actual time=748.683..752.568 rows=797 loops=1)
              Recheck Cond: ((category = '7248c3b8-727e-4357-a267-e9b0e3e36d4b'::bpchar) AND ("timestamp" >= '2018-05-04 00:00:00'::timestamp without time zone) AND ("timestamp" < '2018-10-31 17:57:59.661664'::timestamp without time zone))
              Filter: ((NOT flagged_as_bot) AND (NOT flagged_as_spam))
              Rows Removed by Filter: 1576
              Heap Blocks: exact=1906
              ->  Bitmap Index Scan on time_index  (cost=0.00..552.30 rows=2150 width=0) (actual time=748.324..748.324 rows=2373 loops=1)
                    Index Cond: ((category = '7248c3b8-727e-4357-a267-e9b0e3e36d4b'::bpchar) AND ("timestamp" >= '2018-05-04 00:00:00'::timestamp without time zone) AND ("timestamp" < '2018-10-31 17:57:59.661664'::timestamp without time zone))
                    Planning time: 0.628 ms
Execution time: 753.362 ms"

对于具有大量记录(> 100,000)的类别,不使用索引并且查询非常慢:

GroupAggregate  (cost=1232229.95..1287491.60 rows=2126204 width=70) (actual time=14649.671..17178.955 rows=181 loops=1)
  Group Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval))), category
  ->  Sort  (cost=1232229.95..1238072.10 rows=2336859 width=62) (actual time=14643.887..16031.031 rows=3070695 loops=1)
        Sort Key: (date_trunc('day'::text, ("timestamp" + '-05:00:00'::interval)))
        Sort Method: external merge  Disk: 216200kB
        ->  Seq Scan on listens  (cost=0.00..809314.38 rows=2336859 width=62) (actual time=0.015..9572.722 rows=3070695 loops=1)
              Filter: ((NOT flagged_as_bot) AND (NOT flagged_as_spam) AND ("timestamp" >= '2018-05-04 00:00:00'::timestamp without time zone) AND ("timestamp" < '2018-10-31 17:57:59.661664'::timestamp without time zone) AND (category = '3b634b32-bb82-4f56-ada4-f4b7bc4288a5'::bpchar))
              Rows Removed by Filter: 8788028
              Planning time: 0.239 ms
              Execution time: 17228.314 ms

我的假设是不使用该索引,因为使用索引的开销远远高于仅执行表扫描的开销。当然,我想这是因为使用date_trunc来计算分组依据的日期。

我已经考虑过这里可以做什么。这是我的一些想法:

  • 最简单的是,我可以为我关心的每个时区偏移量创建一个表达索引(通常为GMT / EST / CST / MST / PST)。这将占用大量空间,并且很少使用每个索引,但是从理论上讲,它仅允许索引扫描。
  • 我可以创建一个按小时截断的表达式索引。我不知道这是否有助于Postgres优化查询。
  • 我可以预先计算每个日期范围,并使用一些子查询魔术来查询每个范围的事件计数。我也不知道这是否会带来任何改善。

在我掉进兔子洞之前,我想我要出去看看是否有人有任何想法。

0 个答案:

没有答案