Postgres:为什么添加索引会减慢正则表达式查询的速度?

时间:2019-06-08 11:13:42

标签: sql postgresql database-performance postgresql-performance postgresql-11

我在Postgres中有一个TEXT indexPath.row列:

keyvalues

我的查询必须从比赛中间排除select * from test5 limit 5; id | keyvalues ----+------------------------------------------------------ 1 | ^ first 1 | second 3 2 | ^ first 1 | second 2 ^ first 2 | second 3 3 | ^ first 1 | second 2 | second 3 4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2 5 | ^ first 2 | second 3 ^ first 1 | second 3 字符,因此我使用的是正则表达式:

^

查询有效(零行匹配),但是在7秒以上的速度太慢了。

我认为使用三字母组合索引会有所帮助,但没有运气:

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
   ->  Gather  (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
               ->  Parallel Seq Scan on test5  (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
                     Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
                     Rows Removed by Filter: 1666668
 Planning Time: 0.068 ms
 Execution Time: 7337.184 ms

带有三字母组合索引的查询要慢 3倍!它仍然返回正确的结果(零行)。我希望trigram索引能够立即发现任何地方都没有create extension if not exists pg_trgm; create index on test5 using gin (keyvalues gin_trgm_ops); explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1) -> Bitmap Heap Scan on test5 (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1) Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text) Rows Removed by Index Recheck: 5000005 Heap Blocks: exact=47620 -> Bitmap Index Scan on test5_keyvalues_idx (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1) Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text) Planning Time: 0.412 ms Execution Time: 23734.722 ms 字符串,而且速度很快。

(动机:我想避免将second 0标准化为another table,所以我希望使用文本索引和在单个keyvalues字段中编码匹配逻辑逻辑运算符有效,但是太慢了,as is JSONB。)

2 个答案:

答案 0 :(得分:2)

如您所见,这对于三字组不能很好地工作。 Trigram在空间边界上不匹配,因此,如果您的所有数据都包含相同的单词,则索引将匹配每一行。

这可能使事情更清楚:

with data as (select * from (values ('^ first 1 | second 3'), 
                                    ('^ first 1 | second 2 ^ first 2 | second 3'), 
                                    ('^ first 1 | second 2 | second 3'), 
                                    ('^ first 2 | second 3 ^ first 1 | second 2 | second 2'), 
                                    ('^ first 2 | second 3 ^ first 1 | second 3')
                             ) v(keyvalues)
)
select keyvalues, show_trgm(keyvalues) from data;
                      keyvalues                       |                                               show_trgm
------------------------------------------------------+-------------------------------------------------------------------------------------------------------
 ^ first 1 | second 3                                 | {"  1","  3","  f","  s"," 1 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 1 | second 2 ^ first 2 | second 3            | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 1 | second 2 | second 3                      | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 2 | second 3 ^ first 1 | second 2 | second 2 | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 2 | second 3 ^ first 1 | second 3            | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}

您可以使用部分索引来排除中间带有^的行吗?

答案 1 :(得分:1)

根据OP,用户@jjanes在DBA.SE上给出了正确的答案here

  
    

我希望trigram索引能够立即找出在任何地方都没有 * Note: Using again the example "foo bar", we will not consider the * trigram " b", though this trigram would be found by the trigram * extraction code. Since we will find " ba", it doesn't seem worth * trying to hack the algorithm to generate the additional trigram. 字符串的情况

  
     

“第二”和“ 0”是单独的词,因此它无法检测到它们的联合缺失。似乎可以检测到'0'的缺失,但是来自“ contrib / pg_trgm / trgm_regexp.c”的注释似乎很相关:

{{1}}
     

由于模式字符串中的最后一个字符为0,因此也不会有形式为“ 0a”的三字母组合,因此就错过了这个机会。

     

即使不是出于此限制,您的方法似乎也非常脆弱。