我在一个基本上收集日志记录信息的表上经常进行以下两个查询。两者都从大量行中选择不同的值,但在这些行中的值不到10个。
我已经分析了页面完成的两个“不同”查询:
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=1070734.05..1070734.11 rows=6 width=21)
-> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023050.24 rows=19073524 width=21)
(2 rows)
marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=1070735.34..1070735.39 rows=5 width=13)
-> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023051.47 rows=19073547 width=13)
(2 rows)
两者都对列进行序列扫描。但是,如果我关闭enable_seqscan(显示名称,这只会禁用对带索引的列进行序列扫描),查询将使用索引,但速度更慢:
marchena=> set enable_seqscan = off;
SET
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..19613740.62 rows=6 width=21)
-> Index Scan using audit_bundle_idx on audit_records auditrecor0_ (cost=0.00..19566056.69 rows=19073570 width=21)
(2 rows)
marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..45851449.96 rows=5 width=13)
-> Index Scan using audit_server_idx on audit_records auditrecor0_ (cost=0.00..45803766.04 rows=19073570 width=13)
(2 rows)
bundle_id和server_name列都有btree索引,我应该使用不同类型的索引来快速选择不同的值吗?
答案 0 :(得分:15)
BEGIN;
CREATE TABLE dist ( x INTEGER NOT NULL );
INSERT INTO dist SELECT random()*50 FROM generate_series( 1, 5000000 );
COMMIT;
CREATE INDEX dist_x ON dist(x);
VACUUM ANALYZE dist;
EXPLAIN ANALYZE SELECT DISTINCT x FROM dist;
HashAggregate (cost=84624.00..84624.51 rows=51 width=4) (actual time=1840.141..1840.153 rows=51 loops=1)
-> Seq Scan on dist (cost=0.00..72124.00 rows=5000000 width=4) (actual time=0.003..573.819 rows=5000000 loops=1)
Total runtime: 1848.060 ms
PG不能(还)使用不同的索引(跳过相同的值)但你可以这样做:
CREATE OR REPLACE FUNCTION distinct_skip_foo()
RETURNS SETOF INTEGER
LANGUAGE plpgsql STABLE
AS $$
DECLARE
_x INTEGER;
BEGIN
_x := min(x) FROM dist;
WHILE _x IS NOT NULL LOOP
RETURN NEXT _x;
_x := min(x) FROM dist WHERE x > _x;
END LOOP;
END;
$$ ;
EXPLAIN ANALYZE SELECT * FROM distinct_skip_foo();
Function Scan on distinct_skip_foo (cost=0.00..260.00 rows=1000 width=4) (actual time=1.629..1.635 rows=51 loops=1)
Total runtime: 1.652 ms
答案 1 :(得分:7)
您正在从整个表中选择不同的值,这会自动导致seq扫描。你有数百万行,所以它一定很慢。
有一个技巧可以更快地获得不同的值,但只有在数据具有已知(且相当小)的可能值集时才有效。例如,我认为你的bundle_id引用了一些较小的bundle表。这意味着你可以写:
select bundles.bundle_id
from bundles
where exists (
select 1 from audit_records
where audit_records.bundle_id = bundles.bundle_id
);
这应该导致在bundle上嵌套循环/ seq扫描 - >使用bundle_id上的索引对audit_records进行索引扫描。
答案 2 :(得分:4)
我和表格有同样的问题> 300万条记录和一个带有一些不同值的索引字段。我无法摆脱seq扫描,所以我使用索引来模拟一个独特的搜索,如果它存在。如果您的表具有与记录总数成比例的许多不同值,则此功能不佳。它还必须针对多列不同值进行调整。 警告:此函数对sql注入是开放的,只应在安全的环境中使用。
解释分析结果:
使用普通SELECT DISTINCT进行查询:总运行时间:598310.705 ms
使用SELECT small_distinct(...)进行查询:总运行时间:1.156 ms
CREATE OR REPLACE FUNCTION small_distinct(
tableName varchar, fieldName varchar, sample anyelement = ''::varchar)
-- Search a few distinct values in a possibly huge table
-- Parameters: tableName or query expression, fieldName,
-- sample: any value to specify result type (defaut is varchar)
-- Author: T.Husson, 2012-09-17, distribute/use freely
RETURNS TABLE ( result anyelement ) AS
$BODY$
BEGIN
EXECUTE 'SELECT '||fieldName||' FROM '||tableName||' ORDER BY '||fieldName
||' LIMIT 1' INTO result;
WHILE result IS NOT NULL LOOP
RETURN NEXT;
EXECUTE 'SELECT '||fieldName||' FROM '||tableName
||' WHERE '||fieldName||' > $1 ORDER BY ' || fieldName || ' LIMIT 1'
INTO result USING result;
END LOOP;
END;
$BODY$ LANGUAGE plpgsql VOLATILE;
致电样品:
SELECT small_distinct('observations','id_source',1);
SELECT small_distinct('(select * from obs where id_obs > 12345) as temp',
'date_valid','2000-01-01'::timestamp);
SELECT small_distinct('addresses','state');
答案 3 :(得分:1)
在PostgreSQL 9.3上,从Denis的答案开始:
select bundles.bundle_id
from bundles
where exists (
select 1 from audit_records
where audit_records.bundle_id = bundles.bundle_id
);
只需在子查询中添加“限制1”,我就可以获得60倍的加速(对于我的用例,有800万条记录,一个复合索引和10k组合),从1800ms到30ms:
select bundles.bundle_id
from bundles
where exists (
select 1 from audit_records
where audit_records.bundle_id = bundles.bundle_id limit 1
);