Question

大约每10分钟，我就会插入大约50条具有相同时间戳的记录。
这意味着每小时约有600条记录或每天有7.200条记录或每年有2.592.000条记录。
用户希望检索最接近要求时间的时间戳的所有记录。

设计＃1 -一张在timestamp列上带有索引的表：

    CREATE TABLE A (t timestamp, value int);
    CREATE a_idx ON A (t);

单个insert语句创建约50个具有相同时间戳的记录：

    INSERT INTO A VALUES (
      (‘2019-01-02 10:00’, 5),
      (‘2019-01-02 10:00’, 12),
      (‘2019-01-02 10:00’, 7),
       ….
    )

获取所有最接近要求时间的记录
（我使用PostgreSQL中可用的功能great（））：

    SELECT * FROM A WHERE t =
(SELECT t FROM A ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)

我认为此查询效率不高，因为它需要全表扫描。
我计划按时间戳将A表分区为每年1个分区，但是上面的大致匹配仍然很慢。

设计＃2 -创建2个表：
第一张表：保留唯一的时间戳记和自动递增的PK，
第二表：将数据和外键保留在第一表PK

    CREATE TABLE UNIQ_TIMESTAMP (id SERIAL PRIMARY KEY, t timestamp);
    CREATE TABLE DATA (id INTEGER REFERENCES UNIQ_TIMESTAMP (id), value int);
    CREATE INDEX data_time_idx ON DATA (id);

获取所有最接近要求时间的记录：

SELECT * FROM DATA WHERE id =
(SELECT id FROM UNIQ_TIMESTAMP ORDER BY greatest(t - asked_time, asked_time - t) LIMIT 1)

与设计＃1相比，它应该运行得更快，因为嵌套的select会扫描较小的表。
这种方法的缺点：
-我必须插入2张桌子而不是一张桌子
-我失去了按时间戳分区DATA表的功能

您可以推荐什么？

Answer 1

我会使用单表方法，也许按年份进行分区，以使摆脱旧数据变得容易。

创建类似的索引

CREATE INDEX ON a (date_trunc('hour', t + INTERVAL '30 minutes'));

然后像编写查询一样使用查询，但是添加

AND date_trunc('hour', t + INTERVAL '30 minutes')
  = date_trunc('hour', asked_time + INTERVAL '30 minutes')

附加条件充当过滤器并可以使用索引。

Answer 2

您可以使用两个查询的UNION查找最接近给定时间戳的所有时间戳：

(
  select t
  from a
  where t >= timestamp '2019-03-01 17:00:00'
  order by t
  limit 1
)
union all
(
  select t
  from a
  where t <= timestamp '2019-03-01 17:00:00'
  order by t desc
  limit 1
)

这将有效地利用t上的索引。在具有1000万行（约3年数据）的表上，我得到以下执行计划：

Append  (cost=0.57..1.16 rows=2 width=8) (actual time=0.381..0.407 rows=2 loops=1)
  Buffers: shared hit=6 read=4
  I/O Timings: read=0.050
  ->  Limit  (cost=0.57..0.58 rows=1 width=8) (actual time=0.380..0.381 rows=1 loops=1)
        Output: a.t
        Buffers: shared hit=1 read=4
        I/O Timings: read=0.050
        ->  Index Only Scan using a_t_idx on stuff.a  (cost=0.57..253023.35 rows=30699415 width=8) (actual time=0.380..0.380 rows=1 loops=1)
              Output: a.t
              Index Cond: (a.t >= '2019-03-01 17:00:00'::timestamp without time zone)
              Heap Fetches: 0
              Buffers: shared hit=1 read=4
              I/O Timings: read=0.050
  ->  Limit  (cost=0.57..0.58 rows=1 width=8) (actual time=0.024..0.025 rows=1 loops=1)
        Output: a_1.t
        Buffers: shared hit=5
        ->  Index Only Scan Backward using a_t_idx on stuff.a a_1  (cost=0.57..649469.88 rows=78800603 width=8) (actual time=0.024..0.024 rows=1 loops=1)
              Output: a_1.t
              Index Cond: (a_1.t <= '2019-03-01 17:00:00'::timestamp without time zone)
              Heap Fetches: 0
              Buffers: shared hit=5
Planning Time: 1.823 ms
Execution Time: 0.425 ms

如您所见，它只需要很少的I / O操作，而这几乎与表的大小无关。

以上可用于IN条件：

select *
from a
where t in ( 
  (select t
   from a
   where t >= timestamp '2019-03-01 17:00:00'
   order by t
   limit 1)
  union all
  (select t
   from a
   where t <= timestamp '2019-03-01 17:00:00'
   order by t desc
   limit 1)
);

如果您知道在请求的时间戳附近永远不会有超过100个值，则可以完全删除IN查询，而在联合的两个部分都使用limit 100。由于没有第二步可以评估IN条件，因此查询效率更高，但是返回的行可能比您想要的多。

如果您总是在同一年中寻找时间戳，那么按年份进行分区确实可以帮助您。

如果查询过于复杂，则可以将其放入函数中

create or replace function get_closest(p_tocheck timestamp)
  returns timestamp
as
$$
  select *
  from (
     (select t
     from a
     where t >= p_tocheck
     order by t
     limit 1)
    union all
    (select t
     from a
     where t <= p_tocheck
     order by t desc
     limit 1)
  ) x
  order by greatest(t - p_tocheck, p_tocheck - t)
  limit 1;
$$
language sql stable;

查询变得简单：

select *
from a
where t = get_closest(timestamp '2019-03-01 17:00:00');

另一种解决方案是使用btree_gist扩展名，该扩展名提供“距离”运算符<->

然后您可以在时间戳上创建GiST索引：

create index on a using gist (t) ;

并使用以下查询：

select *
from a where t in (select t
                  from a
                  order by t <-> timestamp '2019-03-01 17:00:00'
                  limit 1);

时间序列的数据库设计

2 个答案: