将1000个ID填充到Postgres的SELECT ... WHERE ... IN(...)查询中是否合理?

时间:2012-03-28 16:26:48

标签: sql postgresql search

  

可能重复:
  PostgreSQL - max number of parameters in “IN” clause?

我正在开发一个Web API,以便对可以很好地映射到Postgres表的资源执行RESTful查询。大多数过滤参数也很好地映射到SQL查询的参数。但是,一些过滤参数需要调用我的搜索索引(在本例中为Sphinx服务器)。

最简单的方法是运行搜索,从搜索结果中收集主键,然后将其填入SQL查询的IN (...)子句中。但是,由于搜索可以返回很多主键,我想知道这是否是一个如此明智的想法。

我预计大部分时间(比方说,90%),我的搜索将返回几百个数量级的结果。也许有10%的时间会有数千个结果。

这是一种合理的方法吗?还有更好的方法吗?

3 个答案:

答案 0 :(得分:14)

我强烈赞成采用实验方法来回答性能问题。 @Catcall开了个好头,但他的实验大小比许多真正的数据库要小得多。他的300000个单整数行很容易适合内存,所以没有IO出现;此外,他没有分享实际数字。

我编写了一个类似的实验,但是样本数据的大小大约是我主机上可用内存的7倍(1GB 1-fracctional CPU VM上的7GB数据集,NFS安装文件系统)。有30,000,000行由单个索引bigint和0到400字节之间的随机长度字符串组成。

create table t(id bigint primary key, stuff text);
insert into t(id,stuff) select i, repeat('X',(random()*400)::integer)
from generate_series(0,30000000) i;
analyze t;

以下是解释分析键域中10,100,1,000,10,000和100,000个随机整数的选择IN的运行时间。每个查询都采用以下形式,$ 1替换为设置计数。

explain analyze
select id from t
where id in (
  select (random()*30000000)::integer from generate_series(0,$1)
);

摘要时间

  • ct,tot ms,ms / row
  • 10,84,8.4
  • 100,1185,11.8
  • 1,000,12407,12.4
  • 10,000,109747,11.0
  • 100,000,1016842,10.1

注意每个IN集基数的计划保持不变 - 构建随机整数的哈希聚合,然后循环并为每个值执行单个索引查找。取指时间与IN集的基数接近线性,在8-12 ms /行范围内。更快的存储系统无疑会显着改善这些时间,但实验表明,Pg在aplomb中处理非常大的集合 - 至少从执行速度的角度来看。请注意,如果通过sql语句的bind-parameter或literal interpolation提供列表,则会对查询到服务器的网络传输产生额外的开销,并增加解析时间,但我怀疑它们与IO相比可以忽略不计提取查询的时间。

# fetch 10
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=0.110..84.494 rows=11 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=0.046..0.054 rows=11 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.036..0.039 rows=11 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=7.672..7.673 rows=1 loops=11)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 84.580 ms


# fetch 100
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=12.405..1184.758 rows=101 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=0.095..0.210 rows=101 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.046..0.067 rows=101 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=11.723..11.725 rows=1 loops=101)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 1184.843 ms

# fetch 1,000
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=14.403..12406.667 rows=1001 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=0.609..1.689 rows=1001 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.128..0.332 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=12.381..12.390 rows=1 loops=1001)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 12407.059 ms

# fetch 10,000
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=21.884..109743.854 rows=9998 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=5.761..18.090 rows=9998 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=1.004..3.087 rows=10001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=10.968..10.972 rows=1 loops=9998)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 109747.169 ms

# fetch 100,000
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=110.244..1016781.944 rows=99816 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=110.169..253.947 rows=99816 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=51.141..77.482 rows=100001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=10.176..10.181 rows=1 loops=99816)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 1016842.772 ms

在@Catcall的请求下,我使用CTE和临时表运行了类似的查询。这两种方法都具有相对简单的嵌套循环索引扫描计划,并且在内联IN查询中可比较(但稍慢)运行。

-- CTE
EXPLAIN analyze
with ids as (select (random()*30000000)::integer as val from generate_series(0,1000))
select id from t where id in (select ids.val from ids);

 Nested Loop  (cost=40.00..2351.27 rows=15002521 width=8) (actual time=21.203..12878.329 rows=1001 loops=1)
   CTE ids
     ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.085..0.306 rows=1001 loops=1)
   ->  HashAggregate  (cost=22.50..24.50 rows=200 width=4) (actual time=0.771..1.907 rows=1001 loops=1)
         ->  CTE Scan on ids  (cost=0.00..20.00 rows=1000 width=4) (actual time=0.087..0.552 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=12.859..12.861 rows=1 loops=1001)
         Index Cond: (t.id = ids.val)
 Total runtime: 12878.812 ms
(8 rows)

-- Temp table
create table temp_ids as select (random()*30000000)::bigint as val from generate_series(0,1000);

explain analyze select id from t where t.id in (select val from temp_ids);

 Nested Loop  (cost=17.51..11585.41 rows=1001 width=8) (actual time=7.062..15724.571 rows=1001 loops=1)
   ->  HashAggregate  (cost=17.51..27.52 rows=1001 width=8) (actual time=0.268..1.356 rows=1001 loops=1)
         ->  Seq Scan on temp_ids  (cost=0.00..15.01 rows=1001 width=8) (actual time=0.007..0.080 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=15.703..15.705 rows=1 loops=1001)
         Index Cond: (t.id = temp_ids.val)
 Total runtime: 15725.063 ms

-- another way using join against temptable insteed of IN
explain analyze select id from t join temp_ids on (t.id = temp_ids.val);

Nested Loop  (cost=0.00..24687.88 rows=2140 width=8) (actual time=22.594..16557.789 rows=1001 loops=1)
   ->  Seq Scan on temp_ids  (cost=0.00..31.40 rows=2140 width=8) (actual time=0.014..0.872 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.51 rows=1 width=8) (actual time=16.536..16.537 rows=1 loops=1001)
         Index Cond: (t.id = temp_ids.val)
 Total runtime: 16558.331 ms

如果再次运行,临时表查询运行得非常快,但那是因为id值设置是常量,所以目标数据在缓存中是新鲜的,而Pg没有真正的IO来执行第二次。

答案 1 :(得分:5)

我的一些天真的测试表明,使用IN (...)至少比临时表上的连接和公共表表达式上的连接快一个数量级。 (坦率地说,这让我感到惊讶。)我测试了300000行表中的3000个整数值。

create table integers (
  n integer primary key
);
insert into integers
select generate_series(0, 300000);

-- External ruby program generates 3000 random integers in the range of 0 to 299999.
-- Used Emacs to massage the output into a SQL statement that looks like

explain analyze
select integers.n 
from integers where n in (
100109,
100354 ,
100524 ,
...
);

答案 2 :(得分:3)

回复@Catcall帖子。我无法抗拒对它进行双重测试。太棒了!反直觉。执行计划类似(使用隐式索引的两个查询)SELECT ... IN ...enter image description hereSELECT ... JOIN ...enter image description here

CREATE TABLE integers (
  n integer PRIMARY KEY
);
INSERT INTO integers
SELECT generate_series(0, 300000);

CREATE TABLE search (  n integer );

-- Generate INSERTS and SELECT ... WHERE ... IN (...)
SELECT 'SELECT integers.n 
FROM integers WHERE n IN (' || list || ');',
' INSERT INTO search VALUES '
|| values ||'; ' FROM (
SELECT string_agg( n::text, ',')  AS list, string_agg( '('||n::text||')', ',')  AS values FROM (
SELECT n FROM integers ORDER BY random() LIMIT 3000 ) AS elements ) AS raw


INSERT INTO search VALUES (9155),(189177),(18815),(13027),... ;

EXPLAIN SELECT integers.n 
FROM integers WHERE n IN (9155,189177,18815,13027,...);

EXPLAIN SELECT integers.n FROM integers JOIN search ON integers.n = search.n;