我正在开发一个Web API,以便对可以很好地映射到Postgres表的资源执行RESTful查询。大多数过滤参数也很好地映射到SQL查询的参数。但是,一些过滤参数需要调用我的搜索索引(在本例中为Sphinx服务器)。
最简单的方法是运行搜索,从搜索结果中收集主键,然后将其填入SQL查询的IN (...)
子句中。但是,由于搜索可以返回很多主键,我想知道这是否是一个如此明智的想法。
我预计大部分时间(比方说,90%),我的搜索将返回几百个数量级的结果。也许有10%的时间会有数千个结果。
这是一种合理的方法吗?还有更好的方法吗?
答案 0 :(得分:14)
我强烈赞成采用实验方法来回答性能问题。 @Catcall开了个好头,但他的实验大小比许多真正的数据库要小得多。他的300000个单整数行很容易适合内存,所以没有IO出现;此外,他没有分享实际数字。
我编写了一个类似的实验,但是样本数据的大小大约是我主机上可用内存的7倍(1GB 1-fracctional CPU VM上的7GB数据集,NFS安装文件系统)。有30,000,000行由单个索引bigint和0到400字节之间的随机长度字符串组成。
create table t(id bigint primary key, stuff text);
insert into t(id,stuff) select i, repeat('X',(random()*400)::integer)
from generate_series(0,30000000) i;
analyze t;
以下是解释分析键域中10,100,1,000,10,000和100,000个随机整数的选择IN的运行时间。每个查询都采用以下形式,$ 1替换为设置计数。
explain analyze
select id from t
where id in (
select (random()*30000000)::integer from generate_series(0,$1)
);
摘要时间
注意每个IN集基数的计划保持不变 - 构建随机整数的哈希聚合,然后循环并为每个值执行单个索引查找。取指时间与IN集的基数接近线性,在8-12 ms /行范围内。更快的存储系统无疑会显着改善这些时间,但实验表明,Pg在aplomb中处理非常大的集合 - 至少从执行速度的角度来看。请注意,如果通过sql语句的bind-parameter或literal interpolation提供列表,则会对查询到服务器的网络传输产生额外的开销,并增加解析时间,但我怀疑它们与IO相比可以忽略不计提取查询的时间。
# fetch 10
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=0.110..84.494 rows=11 loops=1)
-> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.046..0.054 rows=11 loops=1)
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.036..0.039 rows=11 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=7.672..7.673 rows=1 loops=11)
Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
Total runtime: 84.580 ms
# fetch 100
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=12.405..1184.758 rows=101 loops=1)
-> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.095..0.210 rows=101 loops=1)
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.046..0.067 rows=101 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=11.723..11.725 rows=1 loops=101)
Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
Total runtime: 1184.843 ms
# fetch 1,000
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=14.403..12406.667 rows=1001 loops=1)
-> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=0.609..1.689 rows=1001 loops=1)
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.128..0.332 rows=1001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=12.381..12.390 rows=1 loops=1001)
Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
Total runtime: 12407.059 ms
# fetch 10,000
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=21.884..109743.854 rows=9998 loops=1)
-> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=5.761..18.090 rows=9998 loops=1)
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=1.004..3.087 rows=10001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=10.968..10.972 rows=1 loops=9998)
Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
Total runtime: 109747.169 ms
# fetch 100,000
Nested Loop (cost=30.00..2341.27 rows=15002521 width=8) (actual time=110.244..1016781.944 rows=99816 loops=1)
-> HashAggregate (cost=30.00..32.00 rows=200 width=4) (actual time=110.169..253.947 rows=99816 loops=1)
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=51.141..77.482 rows=100001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=10.176..10.181 rows=1 loops=99816)
Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
Total runtime: 1016842.772 ms
在@Catcall的请求下,我使用CTE和临时表运行了类似的查询。这两种方法都具有相对简单的嵌套循环索引扫描计划,并且在内联IN查询中可比较(但稍慢)运行。
-- CTE
EXPLAIN analyze
with ids as (select (random()*30000000)::integer as val from generate_series(0,1000))
select id from t where id in (select ids.val from ids);
Nested Loop (cost=40.00..2351.27 rows=15002521 width=8) (actual time=21.203..12878.329 rows=1001 loops=1)
CTE ids
-> Function Scan on generate_series (cost=0.00..17.50 rows=1000 width=0) (actual time=0.085..0.306 rows=1001 loops=1)
-> HashAggregate (cost=22.50..24.50 rows=200 width=4) (actual time=0.771..1.907 rows=1001 loops=1)
-> CTE Scan on ids (cost=0.00..20.00 rows=1000 width=4) (actual time=0.087..0.552 rows=1001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=12.859..12.861 rows=1 loops=1001)
Index Cond: (t.id = ids.val)
Total runtime: 12878.812 ms
(8 rows)
-- Temp table
create table temp_ids as select (random()*30000000)::bigint as val from generate_series(0,1000);
explain analyze select id from t where t.id in (select val from temp_ids);
Nested Loop (cost=17.51..11585.41 rows=1001 width=8) (actual time=7.062..15724.571 rows=1001 loops=1)
-> HashAggregate (cost=17.51..27.52 rows=1001 width=8) (actual time=0.268..1.356 rows=1001 loops=1)
-> Seq Scan on temp_ids (cost=0.00..15.01 rows=1001 width=8) (actual time=0.007..0.080 rows=1001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.53 rows=1 width=8) (actual time=15.703..15.705 rows=1 loops=1001)
Index Cond: (t.id = temp_ids.val)
Total runtime: 15725.063 ms
-- another way using join against temptable insteed of IN
explain analyze select id from t join temp_ids on (t.id = temp_ids.val);
Nested Loop (cost=0.00..24687.88 rows=2140 width=8) (actual time=22.594..16557.789 rows=1001 loops=1)
-> Seq Scan on temp_ids (cost=0.00..31.40 rows=2140 width=8) (actual time=0.014..0.872 rows=1001 loops=1)
-> Index Scan using t_pkey on t (cost=0.00..11.51 rows=1 width=8) (actual time=16.536..16.537 rows=1 loops=1001)
Index Cond: (t.id = temp_ids.val)
Total runtime: 16558.331 ms
如果再次运行,临时表查询运行得非常快,但那是因为id值设置是常量,所以目标数据在缓存中是新鲜的,而Pg没有真正的IO来执行第二次。
答案 1 :(得分:5)
我的一些天真的测试表明,使用IN (...)
至少比临时表上的连接和公共表表达式上的连接快一个数量级。 (坦率地说,这让我感到惊讶。)我测试了300000行表中的3000个整数值。
create table integers (
n integer primary key
);
insert into integers
select generate_series(0, 300000);
-- External ruby program generates 3000 random integers in the range of 0 to 299999.
-- Used Emacs to massage the output into a SQL statement that looks like
explain analyze
select integers.n
from integers where n in (
100109,
100354 ,
100524 ,
...
);
答案 2 :(得分:3)
回复@Catcall帖子。我无法抗拒对它进行双重测试。太棒了!反直觉。执行计划类似(使用隐式索引的两个查询)SELECT ... IN ...
:和SELECT ... JOIN ...
:
CREATE TABLE integers (
n integer PRIMARY KEY
);
INSERT INTO integers
SELECT generate_series(0, 300000);
CREATE TABLE search ( n integer );
-- Generate INSERTS and SELECT ... WHERE ... IN (...)
SELECT 'SELECT integers.n
FROM integers WHERE n IN (' || list || ');',
' INSERT INTO search VALUES '
|| values ||'; ' FROM (
SELECT string_agg( n::text, ',') AS list, string_agg( '('||n::text||')', ',') AS values FROM (
SELECT n FROM integers ORDER BY random() LIMIT 3000 ) AS elements ) AS raw
INSERT INTO search VALUES (9155),(189177),(18815),(13027),... ;
EXPLAIN SELECT integers.n
FROM integers WHERE n IN (9155,189177,18815,13027,...);
EXPLAIN SELECT integers.n FROM integers JOIN search ON integers.n = search.n;