如何提高按特定时间间隔对记录进行计数的查询的性能?

时间:2016-08-08 11:33:08

标签: sql performance postgresql

我使用的是PostgreSQL 9.5。您可以在下面找到我的表格结构,查询结果和查询结果。我想提高查询的性能。查询按特定时间间隔对记录进行计数,例如:250 milliseconds1 second22 minutes2 days and 30 minutes等。

对于像60 minutes这样的较大间隔,查询速度很快,但对于像4 seconds这样的小间隔,查询速度非常慢。

最重要的事情:

  • 我使用大型数据库(2000万行甚至更多但在查询中我使用WHERE子句使用此数据库的一部分,例如:100万或更多)。
  • id_user_table子句中始终有sipWHERE列。在某些情况下,WHERE子句colud包含表的所有列,这取决于用户的选择。
  • 目前,我已在starttime列上创建了B-Tree索引:

    CREATE INDEX starttime_interval ON data_store (starttime);
    

您是否知道一些提高查询效果的方法?

例如,通过:

  • 在列上创建一些索引(索引?以及如何创建它们?),
  • 改进我的查询,
  • 更改PostgreSQL中的一些设置,
  • 或其他人。

这是我桌子的结构:

  column_name  |   udt_name  | length | is_nullable |  key
---------------+-------------+--------+-------------+--------
id             |    int8     |        |     NO      |   PK
id_user_table  |    int4     |        |     NO      |   FK
starttime      | timestamptz |        |     NO      |
time           |   float8    |        |     NO      |
sip            |   varchar   |  100   |     NO      |
dip            |   varchar   |  100   |     NO      |
sport          |    int4     |        |     YES     |
dport          |    int4     |        |     YES     |
proto          |   varchar   |   50   |     NO      |
totbytes       |    int8     |        |     YES     |
info           |    text     |        |     YES     |
label          |   varchar   |   10   |     NO      |

简单SELECT * FROM data_Store WHERE id_user_table=1 and sip='147.32.84.138' ORDER BY starttime会返回此信息:

  id | id_user_table |          starttime         |      sip      |  other columns...
-----+---------------+----------------------------+---------------+--------------------
 185 |       1       | 2011-09-12 15:24:03.248+02 | 147.32.84.138 |        ...
 189 |       1       | 2011-09-12 15:24:03.256+02 | 147.32.84.138 |        ...
 312 |       1       | 2011-09-12 15:24:06.112+02 | 147.32.84.138 |        ...
 313 |       1       | 2011-09-12 15:24:06.119+02 | 147.32.84.138 |        ... 
 450 |       1       | 2011-09-12 15:24:09.196+02 | 147.32.84.138 |        ...
 451 |       1       | 2011-09-12 15:24:09.203+02 | 147.32.84.138 |        ... 
 452 |       1       | 2011-09-12 15:24:09.21+02  | 147.32.84.138 |        ...

这是我对4秒时间间隔的查询:

WITH generate_period AS(

    SELECT generate_series(date_trunc('second',min(starttime)), 
                           date_trunc('second',max(starttime)), 
                           interval '4 second') as tp
    FROM data_store 
    WHERE id_user_table=1 and sip='147.32.84.138' --other restrictions

), data_series AS(

    SELECT date_trunc('second', starttime) AS starttime, count(*) AS ct
    FROM data_store  
    WHERE id_user_table=1 and sip='147.32.84.138' --other restrictions
    GROUP  BY 1

)

SELECT gp.tp AS starttime-from, 
       gp.tp + interval '4 second' AS starttime-to, 
       COALESCE(sum(ds.ct),0) AS ct
FROM  generate_period gp
LEFT JOIN data_series ds ON date_trunc('second',ds.starttime) >= gp.tp 
                        and date_trunc('second',ds.starttime) < gp.tp + interval '4 second'
GROUP BY 1
ORDER BY 1;

这是查询的结果:

      starttime-from    |      starttime-to      |   ct
------------------------+------------------------+---------
 2011-09-12 15:24:03+02 | 2011-09-12 15:24:07+02 |    4
 2011-09-12 15:24:07+02 | 2011-09-12 15:24:11+02 |    3
 2011-09-12 15:24:11+02 | 2011-09-12 15:24:15+02 |    0
           ...          |           ...          |   ...

这是我在pgAdmin中收到的EXPLAIN ANALYZE的结果,时间间隔为4秒:

Sort  (cost=7477837.88..7477838.38 rows=200 width=16) (actual time=1537280.238..1537289.519 rows=60141 loops=1)
  Sort Key: gp.tp
  Sort Method: external merge  Disk: 1792kB
  CTE generate_period
    ->  Aggregate  (cost=166919.73..166924.74 rows=1000 width=8) (actual time=752.301..823.022 rows=60141 loops=1)
          ->  Seq Scan on data_store  (cost=0.00..163427.57 rows=698431 width=8) (actual time=0.034..703.845 rows=679951 loops=1)
                Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))
                Rows Removed by Filter: 4030687
  CTE data_series
    ->  GroupAggregate  (cost=242521.00..250085.18 rows=186076 width=8) (actual time=1233.414..1341.701 rows=57555 loops=1)
          Group Key: (date_trunc('second'::text, data_store_1.starttime))
          ->  Sort  (cost=242521.00..244267.08 rows=698431 width=8) (actual time=1233.407..1284.110 rows=679951 loops=1)
                Sort Key: (date_trunc('second'::text, data_store_1.starttime))
                Sort Method: external sort  Disk: 11960kB
                ->  Seq Scan on data_store data_store_1  (cost=0.00..165173.65 rows=698431 width=8) (actual time=0.043..886.224 rows=679951 loops=1)
                      Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))
                      Rows Removed by Filter: 4030687
  ->  HashAggregate  (cost=7060817.31..7060820.31 rows=200 width=16) (actual time=1537215.586..1537240.698 rows=60141 loops=1)
        Group Key: gp.tp
        ->  Nested Loop Left Join  (cost=0.00..6957441.76 rows=20675111 width=16) (actual time=1985.731..1536921.862 rows=74443 loops=1)
              Join Filter: ((date_trunc('second'::text, ds.starttime) >= gp.tp) AND (date_trunc('second'::text, ds.starttime) < (gp.tp + '00:00:04'::interval)))
              Rows Removed by Join Filter: 3461357700
              ->  CTE Scan on generate_period gp  (cost=0.00..20.00 rows=1000 width=8) (actual time=752.303..910.810 rows=60141 loops=1)
              ->  CTE Scan on data_series ds  (cost=0.00..3721.52 rows=186076 width=16) (actual time=0.021..3.716 rows=57555 loops=60141)
Planning time: 0.258 ms
Execution time: 1537389.102 ms

更新

这是另一个查询,但没有WITH ctedate_trunc()表达式,所以这个查询可能更容易优化:

SELECT gp.tp AS starttime_from, 
       gp.tp + interval '4 second' AS starttime_to, 
       count(ds.id)
FROM (SELECT generate_series(min(starttime), max(starttime), interval '4 second') as tp
      FROM data_store
      WHERE id_user_table=1 and sip='147.32.84.138' --other restrictions
     ) gp
     LEFT JOIN data_store ds 
     ON ds.starttime >= gp.tp and ds.starttime < gp.tp + interval '4 second'
        and id_user_table=1 and sip='147.32.84.138' --other restrictions
group by gp.tp
order by gp.tp;

上述查询比第一个查询快得多。目前starttime列上的B-Tree索引有效,但仍然不够。如果我设置100 milliseconds时间间隔,我还要等待太长时间。 100 milliseconds范围是用户可以设置的最小时间间隔。我刚刚在sip列添加了B-Tree索引,但它没有帮助。

这是我在pgAdmin中收到的EXPLAIN ANALYZE 100 ms时间间隔的结果:

Sort  (cost=14672356.96..14672357.46 rows=200 width=16) (actual time=9380.768..9951.074 rows=2405621 loops=1)
  Sort Key: (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval))
  Sort Method: external merge  Disk: 79880kB
  ->  HashAggregate  (cost=14672346.81..14672349.31 rows=200 width=16) (actual time=6199.538..7232.962 rows=2405621 loops=1)
        Group Key: (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval))
        ->  Nested Loop Left Join  (cost=2.02..14284329.59 rows=77603444 width=16) (actual time=0.321..4764.648 rows=3006226 loops=1)
              ->  Result  (cost=1.58..6.59 rows=1000 width=0) (actual time=0.295..159.147 rows=2405621 loops=1)
                    InitPlan 1 (returns $0)
                      ->  Limit  (cost=0.43..0.79 rows=1 width=8) (actual time=0.208..0.208 rows=1 loops=1)
                            ->  Index Scan using starttime_interval on data_store  (cost=0.43..250437.98 rows=698431 width=8) (actual time=0.204..0.204 rows=1 loops=1)
                                  Index Cond: (starttime IS NOT NULL)
                                  Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))
                                  Rows Removed by Filter: 144
                    InitPlan 2 (returns $1)
                      ->  Limit  (cost=0.43..0.79 rows=1 width=8) (actual time=0.050..0.050 rows=1 loops=1)
                            ->  Index Scan Backward using starttime_interval on data_store data_store_1  (cost=0.43..250437.98 rows=698431 width=8) (actual time=0.049..0.049 rows=1 loops=1)
                                  Index Cond: (starttime IS NOT NULL)
                                  Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))
                                  Rows Removed by Filter: 23
              ->  Index Scan using starttime_interval on data_store ds  (cost=0.44..13508.28 rows=77603 width=16) (actual time=0.002..0.002 rows=0 loops=2405621)
                    Index Cond: ((starttime >= (generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00:00:00.1'::interval))) AND (starttime < ((generate_series(date_trunc('second'::text, $0), date_trunc('second'::text, $1), '00 (...)
                    Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))
                    Rows Removed by Filter: 2
Planning time: 1.299 ms
Execution time: 11641.154 ms

2 个答案:

答案 0 :(得分:0)

正如我在评论中写的那样,您可以使用多列索引:

CREATE INDEX my_index ON data_store (id_user_table, sip, starttime);

这应该从执行计划中删除Filter: ((id_user_table = 1) AND ((sip)::text = '147.32.84.138'::text))(并且因为每个这样的过滤器在循环节省中执行可能会非常高)。

我还准备了替代查询:

select
    min + (max - min) * (least - 1) as starttime_from,
    min + (max - min) * least as starttime_to,
    count
from (
    select
        min,
        max,
        count(1),
        least(
            width_bucket(
                extract(epoch from starttime)::double precision,
                extract(epoch from min)::double precision,
                extract(epoch from max)::double precision,
                ceil(extract(epoch from (max - min))/extract(epoch from query_interval))::integer
            ),
            ceil(extract(epoch from (max - min))/extract(epoch from query_interval))::integer
        )
    from (
        select
            *,
            max(starttime) over (),
            min(starttime) over (),
            '4 second'::interval as query_interval
        from data_store
    ) as subquery2
    group by least, min, max
) as subquery1;

它应该避免嵌套循环,我想它可能会快得多。但是,可能需要进行一些调整以适合您想要的结果(某些日期截断?)。

答案 1 :(得分:0)

基于@pozs和@RadekPostołowicz评论,最终查询如下(4秒时间间隔):

SELECT gp.tp AS starttime_from, gp.tp + interval '4 second' AS starttime_to, count(ds.id)
FROM (SELECT generate_series(min(starttime),max(starttime), interval '4 second') as tp
      FROM data_store
      WHERE id_user_table=1 and sip='147.32.84.138'
      ORDER BY 1
     ) gp 
     LEFT JOIN data_store ds 
     ON ds.id_user_table=1 and ds.sip='147.32.84.138' 
        and ds.starttime >= gp.tp and ds.starttime < gp.tp + interval '4 second'
GROUP BY starttime_from

正如@pozs注意到的那样,对于非常小的时间间隔,查询结果包括许多零计数行。这些行消耗空间。在这种情况下,查询应包含HAVING count(ds.id) > 0限制,但您必须在客户端处理这些0。这是查询的第二个版本,其中包含HAVING限制:

SELECT gp.tp AS starttime_from, gp.tp + interval '4 second' AS starttime_to, count(ds.id)
FROM (SELECT generate_series(min(starttime),max(starttime), interval '4 second') as tp
      FROM data_store
      WHERE id_user_table=1 and sip='147.32.84.138'
      ORDER BY 1
     ) gp 
     LEFT JOIN data_store ds 
     ON ds.id_user_table=1 and ds.sip='147.32.84.138' 
        and ds.starttime >= gp.tp and ds.starttime < gp.tp + interval '4 second'
GROUP BY starttime_from
HAVING count(ds.id) > 0

最重要的是创建@RadekPostołowicz在评论/回答中创建的多列索引:

CREATE INDEX my_index ON data_store (id_user_table, sip, starttime);

为什么这些专栏?因为在每个查询中,我始终使用id_user_table子句中的sipstarttimeWHERE列。