Question

我的系统有很多可以进行测量的设备。这些测量值存储在表“sample_data”中。每台设备一年可能有10M的测量值。大多数时候，用户仅对相等间隔内的100分钟最大对感兴趣一段时间，例如在过去24小时或最后53周。为了获得这100分钟和最大值，将该周期分成100个相等的间隔。从每个间隔中提取最小值和最大值。您会建议使用最有效的方法来查询数据吗？到目前为止，我尝试了以下查询：

WITH periods AS (
  SELECT time.start AS st, time.start + (interval '1 year' / 100) AS en
  FROM generate_series(now() - interval '1 year', now(), interval '1 year' / 100) AS time(start)
)
SELECT s.* FROM sample_data s
  JOIN periods ON s.time BETWEEN periods.st AND periods.en 
  JOIN devices d ON d.customer_id = 23
  WHERE
    s.id = (SELECT id FROM sample_data WHERE device_id = d.id and time BETWEEN periods.st AND periods.en ORDER BY sample ASC LIMIT 1) OR
    s.id = (SELECT id FROM sample_data WHERE device_id = d.id and time BETWEEN periods.st     AND periods.en ORDER BY sample DESC LIMIT 1)

此查询大约需要4秒。它不是很合适，因为sample_data表每个设备最多可包含10M行。我发现它不是以非常优化的方式运行，但不知道为什么。我以为我已经索引了此查询中使用的所有关键字段。

您是否建议我更快地获取此类统计信息？

表“设备”：

       Column       |            Type             |                      Modifiers                       
--------------------+-----------------------------+------------------------------------------------------
 id                 | integer                     | not null default nextval('devices_id_seq'::regclass)
 customer_id        | integer                     | 

    <Other fields skipped as they are not involved into the query>
Indexes:
"devices_pkey" PRIMARY KEY, btree (id)
"index_devices_on_iccid" UNIQUE, btree (iccid)

它有12个设备，而且在查询中指定的customer_id = 23只有4个设备。

表“sample_data”：

     Column     |            Type             |                        Modifiers                         
----------------+-----------------------------+----------------------------------------------------------
id             | integer                     | not null default nextval('sample_data_id_seq'::regclass)
sample         | numeric                     | not null
time           | timestamp without time zone | not null
device_id      | integer                     | not null
customer_id    | integer                     | not null
Indexes:
"sample_data_pkey" PRIMARY KEY, btree (id)
"sample_data_device_id_time_sample_idx" btree (device_id, "time", sample)

它有大约170万行。每个4个设备的大约720K行属于customer_id = 23。该表现在由测试数据填充。

“select version（）”result：

PostgreSQL 9.3.5 on x86_64-apple-darwin13.3.0, compiled by Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn), 64-bit

track_io_timing设置为“on”

EXPLAIN（ANALYZE，BUFFERS）结果如下： http://explain.depesz.com/s/kA12

Answer 1

我的猜测是性能的驱动因素是where子句中的查询。让我们来看看其中一个：

WHERE s.id = (SELECT sd.id
              FROM sample_data sd
              WHERE sd.device_id = d.id and
                    sd.time BETWEEN periods.st AND periods.en
              ORDER BY sd.sample ASC
              LIMIT 1
             )

您有sample_data(devide_id, time, sample)的索引，并且您希望数据库引擎使用此索引。不幸的是，它只能为where子句充分利用索引。由于between，它可能不会使用order by的索引。

是否可以使用order by撰写time？

WHERE s.id = (SELECT id
              FROM sample_data
              WHERE device_id = d.id and
                    time BETWEEN periods.st AND periods.en
              ORDER BY time ASC
              LIMIT 1
             )

慢速PostgreSQL查询一段时间内相等间隔内的分钟和最大值

1 个答案: