PostgreSQL - 获取最新和特定记录的数量

时间:2016-08-22 11:08:00

标签: sql postgresql count timestamp greatest-n-per-group

我有一张桌子:

    select sid, type, status, timestamp from contact_history limit 10;
           sid   | type | status |           timestamp
        ---------+------+--------+-------------------------------
         6291179 |    0 |   1025 | 2015-08-24 13:05:22.501025+02
         68737   |    0 |      5 | 2015-08-24 13:05:32.500005+02
         4987391 |    0 |     65 | 2015-08-24 13:05:35.500065+02
         1189551 |    1 |     65 | 2015-08-24 13:06:05.510065+02
         3374714 |    1 |      5 | 2015-07-27 13:25:25.510005+02
         2297221 |    0 |      5 | 2015-07-27 13:25:48.500005+02
         5503230 |    2 |     65 | 2015-07-27 13:25:50.520065+02
         596992  |    1 |     65 | 2015-07-27 13:26:51.510065+02
         5215455 |    0 |   1025 | 2015-07-27 13:27:21.501025+02
         3011248 |    0 |      5 | 2015-07-27 13:27:46.500005+02
        (10 rows)


\d contact_history
                                      Table "contact_history"
        Column     |           Type           |                          Modifiers
    ---------------+--------------------------+----------------------------------------------------------
     sid           | character varying(32)    | not null
     type          | integer                  | not null
     status        | integer                  | not null
     timestamp     | timestamp with time zone | not null
     id            | bigint                   | not null default nextval('contact_history_id_seq'::regcla
    Indexes:
        "contact_history_pk" PRIMARY KEY, btree (id)
        "contact_history_sid_timestamp_idx" btree (sid, "timestamp")

当每个sid在指定的type达到某个statustimestamp时进行录制。没有uniq行。每个sid都可以随时随机typestatus。有2千万行。 PostgreSQL版本是9.3.13

现在我想知道sid刚刚(type='0' or type='1') and status='5'中有多少max(timestamp) - > sid。换句话说,每个timestamp找到最后一个type和相应的status(type='0' or type='1') and status='5',然后计算满足条件Field的那些。所以我期待一个数字作为输出。其他更有效的方法也可以获得相同的结果。谢谢。

1 个答案:

答案 0 :(得分:0)

感谢a_horse_with_no_name我遵循每组最大的路径。不幸的是,它有点不同。我做了一些猴子设计,到目前为止我得到了以下不同成本的查询:

EXPLAIN SELECT count(*) FROM contact_history t1 LEFT OUTER JOIN contact_history t2 ON (t1.sid = t2.sid AND t1.timestamp < t2.timestamp) WHERE t2.sid IS NULL and (t1.type=0 OR t1.type=1) and t1.status = '5';
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Aggregate  (cost=158816.96..158816.97 rows=1 width=0)
   ->  Hash Anti Join  (cost=66228.91..158003.37 rows=325435 width=0)
         Hash Cond: ((t1.sid)::text = (t2.sid)::text)
         Join Filter: (t1."timestamp" < t2."timestamp")
         ->  Seq Scan on contact_history t1  (cost=0.00..50771.93 rows=488152 width=15)
               Filter: ((status = 5) AND ((type = 0) OR (type = 1)))
         ->  Hash  (cost=39041.96..39041.96 rows=1563996 width=15)
               ->  Seq Scan on contact_history t2  (cost=0.00..39041.96 rows=1563996 width=15)
(8 rows)

EXPLAIN SELECT count(*) from contact_history as ch, (select sid, max(timestamp) as max_t from contact_history group by sid) as sub where ch.sid=sub.sid and ch.timestamp=sub.max_t and (type='0' or type='1') and status = '5';
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Aggregate  (cost=393277.11..393277.12 rows=1 width=0)
   ->  Merge Join  (cost=366994.07..393277.10 rows=2 width=0)
         Merge Cond: ((contact_history.sid)::text = (ch.sid)::text)
         Join Filter: (ch."timestamp" = (max(contact_history."timestamp")))
         ->  GroupAggregate  (cost=253411.17..267270.04 rows=212890 width=15)
               ->  Sort  (cost=253411.17..257321.16 rows=1563996 width=15)
                     Sort Key: contact_history.sid
                     ->  Seq Scan on contact_history  (cost=0.00..39041.96 rows=1563996 width=15)
         ->  Materialize  (cost=113582.90..116023.66 rows=488152 width=15)
               ->  Sort  (cost=113582.90..114803.28 rows=488152 width=15)
                     Sort Key: ch.sid
                     ->  Seq Scan on contact_history ch  (cost=0.00..50771.93 rows=488152 width=15)
                           Filter: ((status = 5) AND ((type = 0) OR (type = 1)))
(13 rows)

EXPLAIN SELECT count(*) FROM contact_history as ch1 WHERE timestamp = (SELECT MAX(timestamp) FROM contact_history AS ch2 WHERE ch1.sid = ch2.sid) and (ch1.type='0' or ch1.type='1') and ch1.status = '5';
                                                                       QUERY PLAN

-----------------------------------------------------------------------------------------------------
---------------------------------------------------
 Aggregate  (cost=7919844.02..7919844.03 rows=1 width=0)
   ->  Seq Scan on contact_history ch1  (cost=0.00..7919837.92 rows=2441 width=0)
         Filter: ((status = 5) AND ((type = 0) OR (type = 1)) AND ("timestamp" = (SubPlan 2)))
         SubPlan 2
           ->  Result  (cost=5.02..5.03 rows=1 width=0)
                 InitPlan 1 (returns $1)
                   ->  Limit  (cost=0.43..5.02 rows=1 width=8)
                         ->  Index Only Scan Backward using contact_history_sid_timestamp_idx on cont
act_history ch2  (cost=0.43..32.57 rows=7 width=8)
                               Index Cond: ((sid = (ch1.sid)::text) AND ("timestamp" IS NOT NULL))
(9 rows)

一些改进,补充,评论或解释超过欢迎。谢谢。