有没有办法在PostgreSQL上使用pg_trgm像带btree索引的运算符?

时间:2018-06-26 14:48:32

标签: postgresql indexing pg-trgm

我有两个表:

  • 具有约100万行的表_1,列id_t1:整数,列c1_t1:varchar等。
  • 具有约5000万行的表_2,其中列id_t2:整数,ref_id_t1:整数,c1_t2:varchar等。

ref_id_t1填充有id_t1值,但是它们不通过外键链接,因为table_2不了解table_1。

我需要在两个表上都做一个请求,如下所示:

SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
(SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%');

由于对表_2进行了顺序扫描,因此没有任何更改或使用基本索引,该请求大约需要一分钟才能完成。为了防止这种情况,我使用gin_trgm_ops选项创建了一个GIN idex:

CREATE EXTENSION pg_trgm;
CREATE INDEX c1_t2_gin_index ON table_2 USING gin (c1_t2, gin_trgm_ops);

但这不能解决问题,因为内部请求仍然需要很长时间。

EXPLAIN ANALYSE SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%'

给出以下内容

Bitmap Heap Scan on table_2 t2 (cost=664.20..189671.00 rows=65058 width=4) (actual time=5101.286..22854.838 rows=69631 loops=1)
  Recheck Cond: ((c1_t2 )::text ~~ '%1.1%'::text)
  Rows Removed by Index Recheck: 49069703
  Heap Blocks: exact=611548
  ->  Bitmap Index Scan on gin_trg  (cost=0.00..647.94 rows=65058 width=0) (actual time=4911.125..4911.125 rows=49139334 loops=1)
        Index Cond: ((c1_t2)::text ~~ '%1.1%'::text)
Planning time: 0.529 ms
Execution time: 22863.017 ms

位图索引扫描是快速的,但是由于我们需要t2.ref_id_t1 PostgreSQL需要执行位图堆扫描,而这在65000行数据上并不快速。

避免位图堆扫描的解决方案是执行“仅索引扫描”。这可以通过使用具有btree索引的多列来实现,请参见https://www.postgresql.org/docs/9.6/static/indexes-index-only-scans.html

如果我更改了请求以搜索c1_t2的开头,即使内部请求返回了90000行,并且如果我在c1_t2和ref_id_t1上创建了btree索引,则该请求也将花费一秒钟以上。

CREATE INDEX c1_t2_ref_id_t1_index
    ON table_2  USING btree
    (c1_t2 varchar_pattern_ops ASC NULLS LAST, ref_id_t1 ASC NULLS LAST)


EXPLAIN ANALYSE SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
    (SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE 'aaa%');

Hash Join  (cost=56561.99..105233.96 rows=1 width=2522) (actual time=953.647..1068.488 rows=36 loops=1)
  Hash Cond: (t1.id_t1 = t2.ref_id_t1)
  ->  Seq Scan on table_1 t1  (cost=0.00..48669.65 rows=615 width=2522) (actual time=0.088..667.576 rows=790 loops=1)
        Filter: (c1_t1 = 'A')
        Rows Removed by Filter: 1083798
  ->  Hash  (cost=56553.74..56553.74 rows=660 width=4) (actual time=400.657..400.657 rows=69632 loops=1)
        Buckets: 131072 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 3472kB
        ->  HashAggregate  (cost=56547.14..56553.74 rows=660 width=4) (actual time=380.280..391.871 rows=69632 loops=1)
              Group Key: t2.ref_id_t1
              ->  Index Only Scan using c1_t2_ref_id_t1_index on table_2 t2   (cost=0.56..53907.28 rows=1055943 width=4) (actual time=0.014..202.034 rows=974737 loops=1)
                    Index Cond: ((c1_t2  ~>=~ 'aaa'::text) AND (c1_t2  ~<~ 'chb'::text))
                    Filter: ((c1_t2 )::text ~~ 'aaa%'::text)
                    Heap Fetches: 0
Planning time: 1.512 ms
Execution time: 1069.712 ms

但是对于gin索引,这是不可能的,因为这些索引不会在密钥中存储所有数据。

是否可以使用类似pg_trmg的扩展名与btree索引一起使用,以便我们只能使用 LIKE'%abc%'请求进行索引扫描?

0 个答案:

没有答案