max()vs ORDER BY DESC + LIMIT 1

时间:2015-12-12 23:54:56

标签: sql postgresql max aggregate sql-limit

我今天正在对一些缓慢的SQL查询进行故障排除,并且不太了解下面的性能差异:

当尝试根据某些条件从数据表中提取max(timestamp)时,如果存在匹配的行,则使用MAX()ORDER BY timestamp LIMIT 1慢,但如果没有匹配的行,则使用SELECT timestamp FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 4 ORDER BY timestamp DESC LIMIT 1; (0 rows) Time: 1314.544 ms SELECT timestamp FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 5 ORDER BY timestamp DESC LIMIT 1; (1 row) Time: 10.890 ms SELECT MAX(timestamp) FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 4; (0 rows) Time: 0.869 ms SELECT MAX(timestamp) FROM data JOIN sensors ON ( sensors.id = data.sensor_id ) WHERE sensor.station_id = 5; (1 row) Time: 84.087 ms 要快得多找到。

(timestamp)

(sensor_id, timestamp)QUERY PLAN (ORDER BY) -------------------------------------------------------------------------------------------------------- Limit (cost=0.43..9.47 rows=1 width=8) -> Nested Loop (cost=0.43..396254.63 rows=43823 width=8) Join Filter: (data.sensor_id = sensors.id) -> Index Scan using timestamp_ind on data (cost=0.43..254918.66 rows=4710976 width=12) -> Materialize (cost=0.00..6.70 rows=2 width=4) -> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4) Filter: (station_id = 4) (7 rows) QUERY PLAN (MAX) ---------------------------------------------------------------------------------------------------------- Aggregate (cost=3680.59..3680.60 rows=1 width=8) -> Nested Loop (cost=0.43..3571.03 rows=43823 width=8) -> Seq Scan on sensors (cost=0.00..6.69 rows=2 width=4) Filter: (station_id = 4) -> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) Index Cond: (sensor_id = sensors.id) (6 rows) 上有索引,我注意到Postgres对这两种情况使用了非常不同的查询计划和索引:

EXISTS

所以我的两个问题是:

  1. 这种性能差异来自哪里?我已经在MIN/MAX vs ORDER BY and LIMIT看到了接受的答案,但这似乎并不适用于此处。任何好的资源都会受到赞赏。
  2. 是否有更好的方法可以在所有情况下(匹配行与无匹配行)提高效果,而不是添加 Table "public.sensors" Column | Type | Modifiers ----------------------+------------------------+----------------------------------------------------------------- id | integer | not null default nextval('sensors_id_seq'::regclass) station_id | integer | not null .... Indexes: "sensor_primary" PRIMARY KEY, btree (id) "ind_station_id" btree (station_id, id) "ind_station" btree (station_id) Table "public.data" Column | Type | Modifiers -----------+--------------------------+------------------------------------------------------------------ id | integer | not null default nextval('data_id_seq'::regclass) timestamp | timestamp with time zone | not null sensor_id | integer | not null avg | integer | Indexes: "timestamp_ind" btree ("timestamp" DESC) "sensor_ind" btree (sensor_id) "sensor_ind_timestamp" btree (sensor_id, "timestamp") "sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC) 支票?
  3. 编辑以解决以下评论中的问题。我保留了上面的初始查询计划以供将来参考:

    表定义:

    ind_station_id

    请注意,我刚刚在@ Erwin的建议之后在sensors添加了>1200ms。在ORDER BY DESC + LIMIT 1案件~0.9ms案件MAXQUERY PLAN (ORDER BY) ---------------------------------------------------------------------------------------------------------- Limit (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1) Buffers: shared hit=3418066 read=47326 -> Nested Loop (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1) Join Filter: (data.sensor_id = sensors.id) Buffers: shared hit=3418066 read=47326 -> Index Scan using timestamp_ind on data (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1) Buffers: shared hit=3418065 read=47326 -> Materialize (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976) Buffers: shared hit=1 -> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1) Index Cond: (station_id = 4) Heap Fetches: 0 Buffers: shared hit=1 Planning time: 0.478 ms Execution time: 2161.090 ms (15 rows) QUERY (MAX) ---------------------------------------------------------------------------------------------------------- Aggregate (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1) Buffers: shared hit=1 -> Nested Loop (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1) Buffers: shared hit=1 -> Index Only Scan using ind_station_id on sensors (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1) Index Cond: (station_id = 4) Heap Fetches: 0 Buffers: shared hit=1 -> Index Only Scan using sensor_ind_timestamp on data (cost=0.43..1389.59 rows=39258 width=12) (never executed) Index Cond: (sensor_id = sensors.id) Heap Fetches: 0 Planning time: 0.435 ms Execution time: 0.048 ms (13 rows) 案件ORDER BY案件中,时间确实发生了巨大变化。

    查询计划:

    Scan using timestamp_in on data

    就像之前的解释一样,MAX会执行PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit,而NOT NULL案例中没有这样做。

    Postgres版本: 来自Ubuntu回购的Postgres:ORDER BY

    请注意,存在EXISTS (<1ms)个约束,因此SELECT (~11ms)不必排空空行。

    另请注意,我对差异的来源非常感兴趣。虽然不理想,但我可以使用 Map<String, Object> properties = new HashMap<String, Object>(); properties.put(EJBContainer.MODULES, new File("target/classes")); ec = EJBContainer.createEJBContainer(properties); 然后NewLetterDimensions()相对快速地检索数据。

2 个答案:

答案 0 :(得分:10)

sensor.station_id似乎没有索引,这在这里很重要。

max()ORDER BY DESC + LIMIT 1之间存在实际的差异。很多人似乎都错过了。 NULL值按降序排序顺序排序 。因此ORDER BY timestamp DESC LIMIT 1会返回timestamp IS NULL行(如果存在),而聚合函数max() 会忽略 NULL值并返回最新的非空时间戳。

对于您的情况,由于您的列d.timestamp已定义为NOT NULL(如您的更新所示),因此无效差异。包含DESC NULLS LAST的索引和ORDER BY LIMIT查询中的相同子句仍应为您提供最佳服务。我建议这些索引(我的查询建立在第二个上面):

sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)

您可以删除其他索引变体 sensor_ind_timestamp sensor_ind_timestamp_desc ,除非您有其他查询仍然需要它们(不太可能,但可能)

更重要的是,还有另一个难点:第一个表sensors上的过滤器返回很少但仍然(可能)多行。 Postgres 希望在您添加的rows=2输出中找到2行(EXPLAIN)。
完美的技术是第二个表data 松散索引扫描 - 目前在Postgres 9.4(或Postgres 9.5)中没有实现。您可以通过各种方式重写查询以解决此限制。详细说明:

最好的应该是:

SELECT d.timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4
ORDER  BY d.timestamp DESC NULLS LAST
LIMIT  1;

由于外部查询的样式大多无关紧要,您也可以:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4;

max()变体的执行速度应该与现在一样快:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT max(timestamp) AS timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ) d
WHERE  s.station_id = 4;

甚至,最短的

SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM   sensors s
WHERE  station_id = 4;

注意双括号!

LIMIT联接中LATERAL的另一个好处是,您可以检索所选行的任意列,而不仅仅是最新的时间戳(一列)。

相关:

答案 1 :(得分:2)

查询计划显示索引名称timestamp_indtimestamp_sensor_ind。但是这样的索引对搜索特定传感器没有帮助。

要解析等于查询(如sensor.id = data.sensor_id),列必须是索引中的第一个。尝试添加一个允许在sensor_id上搜索的索引,并在传感器中按时间戳排序:

create index sensor_timestamp_ind on data(sensor_id, timestamp);

添加该索引会加快查询速度吗?