如何优化此LIKE JOIN查询?

时间:2020-06-20 11:55:55

标签: sql postgresql postgresql-9.5

此查询查找域的后缀:

@LazySingleton

编辑:请注意,这也适用于子域。

您可以摆弄示例here,并使用pev可视化计划。我尝试将覆盖索引添加到表中,但最终它们未被查询计划人员使用。也许还有另一个查询可能更有效?

4 个答案:

答案 0 :(得分:2)

索引对于您的数据结构/查询没有任何优势。只是想像一下如何在这里使用索引。我没有运气。

我的建议是将域/后缀转换为类似的数组

alter table "companyDomain" add column adomain text[];
update "companyDomain" set adomain = string_to_array(domain, '.');
create index idx_adom on "companyDomain" using gin (adomain array_ops);

alter table "publicSuffix" add column asuffix text[];
update "publicSuffix" set asuffix = string_to_array(ltrim(suffix, '.'), '.');
create index idx_asuffix on "publicSuffix" using gin (asuffix array_ops);

让我们比较这些查询:

ostgres=# explain (analyze, verbose, buffers)
SELECT  DISTINCT ON ("companyDomain".id)
    "companyDomain".domain,
    "publicSuffix".suffix
FROM
    "companyDomain"
        INNER JOIN "publicSuffix" ON REVERSE("companyDomain".domain) LIKE REVERSE("publicSuffix".suffix) || '%'
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN                                                                   │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique  (cost=185738.35..185940.72 rows=908 width=31) (actual time=2364.720..2364.890 rows=908 loops=1)                                        │
│   Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                           │
│   Buffers: shared hit=306                                                                                                                      │
│   ->  Sort  (cost=185738.35..185839.53 rows=40474 width=31) (actual time=2364.719..2364.764 rows=1006 loops=1)                                 │
│         Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                     │
│         Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                             │
│         Sort Method: quicksort  Memory: 103kB                                                                                                  │
│         Buffers: shared hit=306                                                                                                                │
│         ->  Nested Loop  (cost=0.00..182641.13 rows=40474 width=31) (actual time=22.735..2364.484 rows=1006 loops=1)                           │
│               Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, length(("publicSuffix".suffix)::text)                 │
│               Join Filter: (reverse(("companyDomain".domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text))                  │
│               Rows Removed by Join Filter: 8093814                                                                                             │
│               Buffers: shared hit=306                                                                                                          │
│               ->  Seq Scan on public."publicSuffix"  (cost=0.00..377.15 rows=8915 width=12) (actual time=0.081..0.794 rows=8915 loops=1)       │
│                     Output: "publicSuffix".id, "publicSuffix".suffix, "publicSuffix".created_at, "publicSuffix".asuffix                        │
│                     Buffers: shared hit=288                                                                                                    │
│               ->  Materialize  (cost=0.00..31.62 rows=908 width=15) (actual time=0.001..0.036 rows=908 loops=8915)                             │
│                     Output: "companyDomain".domain, "companyDomain".id                                                                         │
│                     Buffers: shared hit=18                                                                                                     │
│                     ->  Seq Scan on public."companyDomain"  (cost=0.00..27.08 rows=908 width=15) (actual time=11.576..11.799 rows=908 loops=1) │
│                           Output: "companyDomain".domain, "companyDomain".id                                                                   │
│                           Buffers: shared hit=18                                                                                               │
│ Planning Time: 0.167 ms                                                                                                                        │
│ JIT:                                                                                                                                           │
│   Functions: 9                                                                                                                                 │
│   Options: Inlining false, Optimization false, Expressions true, Deforming true                                                                │
│   Timing: Generation 1.956 ms, Inlining 0.000 ms, Optimization 0.507 ms, Emission 10.878 ms, Total 13.341 ms                                   │
│ Execution Time: 2366.971 ms                                                                                                                    │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

据我所知,这里的瓶颈Rows Removed by Join Filter: 8093814

似乎PostgreSQL建立了表的笛卡尔连接,然后使用ON条件对其进行过滤:

select count(*) from "companyDomain", "publicSuffix";
---
8094820

要解决此问题,请尝试使用array operator

postgres=# explain (analyze, verbose, buffers)
SELECT  DISTINCT ON ("companyDomain".id)
    "companyDomain".domain,
    "publicSuffix".suffix
FROM
    "companyDomain"
        INNER JOIN "publicSuffix" ON adomain @> asuffix
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                 QUERY PLAN                                                                  │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique  (cost=8310.60..8512.97 rows=908 width=31) (actual time=180.149..180.335 rows=908 loops=1)                                           │
│   Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                        │
│   Buffers: shared hit=48986                                                                                                                 │
│   ->  Sort  (cost=8310.60..8411.78 rows=40474 width=31) (actual time=180.148..180.200 rows=1239 loops=1)                                    │
│         Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text))                  │
│         Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                          │
│         Sort Method: quicksort  Memory: 145kB                                                                                               │
│         Buffers: shared hit=48986                                                                                                           │
│         ->  Nested Loop  (cost=0.59..5213.39 rows=40474 width=31) (actual time=0.190..179.693 rows=1239 loops=1)                            │
│               Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, length(("publicSuffix".suffix)::text)              │
│               Buffers: shared hit=48986                                                                                                     │
│               ->  Seq Scan on public."companyDomain"  (cost=0.00..27.08 rows=908 width=57) (actual time=0.015..0.098 rows=908 loops=1)      │
│                     Output: "companyDomain".id, "companyDomain".domain, "companyDomain".created_at, "companyDomain".adomain                 │
│                     Buffers: shared hit=18                                                                                                  │
│               ->  Bitmap Heap Scan on public."publicSuffix"  (cost=0.59..5.15 rows=45 width=54) (actual time=0.052..0.197 rows=1 loops=908) │
│                     Output: "publicSuffix".id, "publicSuffix".suffix, "publicSuffix".created_at, "publicSuffix".asuffix                     │
│                     Recheck Cond: ("companyDomain".adomain @> "publicSuffix".asuffix)                                                       │
│                     Rows Removed by Index Recheck: 572                                                                                      │
│                     Heap Blocks: exact=41510                                                                                                │
│                     Buffers: shared hit=48968                                                                                               │
│                     ->  Bitmap Index Scan on idx_asuffix  (cost=0.00..0.58 rows=45 width=0) (actual time=0.039..0.039 rows=573 loops=908)   │
│                           Index Cond: ("publicSuffix".asuffix <@ "companyDomain".adomain)                                                   │
│                           Buffers: shared hit=7458                                                                                          │
│ Planning Time: 0.189 ms                                                                                                                     │
│ Execution Time: 180.434 ms                                                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

它可能不太准确(例如aaa.bbb等于bbb.aaa),但是您可以在WHERE子句中对其进行修复。无论如何,它都会更快。

现在,旧的domainsuffix列是多余的,因为您可以使用array_to_string(anyarray, text [, text]) functionadomain/asuffix还原它们。

作为一种选择,为避免更改表结构,可以在string_to_array()上创建功能索引,然后在过滤器/联接中使用它。

答案 1 :(得分:2)

您是否考虑过使用gin索引?

我对您的示例DML进行了以下修改:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
...
CREATE INDEX companyDomain_domain_reverse ON "companyDomain" USING gin (REVERSE(domain) gin_trgm_ops);
...
CREATE INDEX publicSuffix_suffix_reverse ON "publicSuffix" USING gin (REVERSE(suffix) gin_trgm_ops);

这是查询计划:

+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|Unique  (cost=40802.07..41004.44 rows=908 width=31) (actual time=98.229..98.356 rows=908 loops=1)                                                       |
|  ->  Sort  (cost=40802.07..40903.26 rows=40474 width=31) (actual time=98.228..98.264 rows=1006 loops=1)                                                |
|        Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                                      |
|        Sort Method: quicksort  Memory: 103kB                                                                                                           |
|        ->  Nested Loop  (cost=0.05..37704.86 rows=40474 width=31) (actual time=1.655..97.976 rows=1006 loops=1)                                        |
|              ->  Seq Scan on "publicSuffix"  (cost=0.00..151.15 rows=8915 width=12) (actual time=0.011..0.728 rows=8915 loops=1)                       |
|              ->  Bitmap Heap Scan on "companyDomain"  (cost=0.05..4.15 rows=5 width=15) (actual time=0.010..0.010 rows=0 loops=8915)                   |
|                    Recheck Cond: (reverse((domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text))                                    |
|                    Rows Removed by Index Recheck: 0                                                                                                    |
|                    Heap Blocks: exact=301                                                                                                              |
|                    ->  Bitmap Index Scan on companydomain_domain_reverse  (cost=0.00..0.05 rows=5 width=0) (actual time=0.010..0.010 rows=0 loops=8915)|
|                          Index Cond: (reverse((domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text))                                |
|Planning Time: 0.150 ms                                                                                                                                 |
|Execution Time: 98.439 ms                                                                                                                               |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+

作为奖励-您甚至不需要REVERSE()索引和查询中的文本:

create index companydomain_domain
    on "companyDomain" using gin(domain gin_trgm_ops);



SELECT DISTINCT ON ("companyDomain".id) "companyDomain".domain, "publicSuffix".suffix
FROM "companyDomain"
         INNER JOIN "publicSuffix" ON "companyDomain".domain LIKE '%' || "publicSuffix".suffix
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC

查询花费相同的时间,但仍使用gin索引:

+------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|Unique  (cost=40556.91..40759.28 rows=908 width=31) (actual time=96.170..96.315 rows=908 loops=1)                                               |
|  ->  Sort  (cost=40556.91..40658.10 rows=40474 width=31) (actual time=96.169..96.209 rows=1006 loops=1)                                        |
|        Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC                                                              |
|        Sort Method: quicksort  Memory: 103kB                                                                                                   |
|        ->  Nested Loop  (cost=0.05..37459.70 rows=40474 width=31) (actual time=1.764..95.919 rows=1006 loops=1)                                |
|              ->  Seq Scan on "publicSuffix"  (cost=0.00..151.15 rows=8915 width=12) (actual time=0.009..0.711 rows=8915 loops=1)               |
|              ->  Bitmap Heap Scan on "companyDomain"  (cost=0.05..4.12 rows=5 width=15) (actual time=0.010..0.010 rows=0 loops=8915)           |
|                    Recheck Cond: ((domain)::text ~~ ('%'::text || ("publicSuffix".suffix)::text))                                              |
|                    Rows Removed by Index Recheck: 0                                                                                            |
|                    Heap Blocks: exact=301                                                                                                      |
|                    ->  Bitmap Index Scan on companydomain_domain  (cost=0.00..0.05 rows=5 width=0) (actual time=0.010..0.010 rows=0 loops=8915)|
|                          Index Cond: ((domain)::text ~~ ('%'::text || ("publicSuffix".suffix)::text))                                          |
|Planning Time: 0.132 ms                                                                                                                         |
|Execution Time: 96.393 ms                                                                                                                       |
+------------------------------------------------------------------------------------------------------------------------------------------------+

PS:我想您只需要一个索引-在这种情况下:companyDomain_domain_reverse

答案 2 :(得分:1)

您想要像这样的比赛

'something.google.com' like '%google.com'

但是您知道PostgreSQL不会为此使用索引,因为模式字符串以通配符开头。因此,您将两个字符串都反转了:

'moc.elgoog.gnihtemos' like 'moc.elgoog%'

并在REVERSE("companyDomain".domain)上创建函数索引。

这是一个很好的主意,但是PostgreSQL不使用您的索引。这是因为DBMS不知道您的字符串中包含什么(因为这是表数据,并且DBMS不会先读取整个表才能制定计划)。在最坏的情况下,所有反向后缀均以'%'开头。如果DBMS在这种情况下决定通过索引,则速度可能会非常慢。 知道后缀不会以'%'结尾,但是DBMS不会,并且决定制定安全计划(全表扫描)。

在此处记录:https://www.postgresql.org/docs/9.2/indexes-types.html

优化器还可以使用B树索引进行涉及模式匹配运算符LIKE和〜如果模式是常量的查询 ...

我没有办法说服PostgreSQL使用索引是安全的。 AND REVERSE("publicSuffix".suffix) || '%' NOT LIKE '/%%' ESCCAPE '/'并没有帮助。

我认为,最好的办法是在RIGHT(domain, 3)RIGHT(suffix, 3)上使用索引,因为我们知道包括点在内的后缀至少应包含三个字符。这样可以将匹配范围缩小到足够有用。

CREATE INDEX idx_publicSuffix_suffix3 ON "publicSuffix"(RIGHT(suffix, 3) varchar_pattern_ops, suffix);

CREATE INDEX idx_companyDomain_domain3 ON "companyDomain"(RIGHT(domain, 3) varchar_pattern_ops, id, domain);

SELECT DISTINCT ON (cd.id)
  cd.domain,
  ps.suffix
FROM "companyDomain" cd
JOIN "publicSuffix" ps ON cd.domain LIKE '%' || ps.suffix
                       AND RIGHT(cd.domain, 3) = RIGHT(ps.suffix, 3)
ORDER BY cd.id, LENGTH(ps.suffix) DESC;

演示:https://www.db-fiddle.com/f/dPpVFWjpVJHYFnVut4k7wS/1

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
¦                                                                   QUERY PLAN                                                                                                     ¦
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
¦ Unique  (cost=1684.72..1685.71 rows=198 width=72) (actual time=165.676..165.882 rows=908 loops=1)                                                                                ¦
¦     Buffers: shared hit=4079                                                                                                                                                     ¦
¦     ->  Sort  (cost=1684.72..1685.22 rows=198 width=72) (actual time=165.675..165.723 rows=1006 loops=1)                                                                         ¦
¦           Sort Key: cd.id, (length((ps.suffix)::text)) DESC                                                                                                                      ¦
¦           Sort Method: quicksort Memory: 103kB                                                                                                                                   ¦
¦           Buffers: shared hit=4079                                                                                                                                               ¦
¦           ->  Merge Join  (cost=0.56..1677.17 rows=198 width=72) (actual time=0.090..165.222 rows=1006 loops=1)                                                                  ¦
¦                 Buffers: shared hit=4076                                                                                                                                         ¦
¦                 ->  Index Only Scan using idx_companydomain_domain3 on companyDomain cd  (cost=0.28..93.23 rows=1130 width=36) (actual time=0.018..0.429 rows=908 loops=1)       ¦
¦                       Heap Fetches: 908                                                                                                                                          ¦
¦                       Buffers: shared hit=109                                                                                                                                    ¦
¦                 ->  Materialize  (cost=0.28..602.89 rows=7006 width=32) (actual time=0.019..47.510 rows=390620 loops=1)                                                          ¦
¦                       Buffers: shared hit=3967                                                                                                                                   ¦
¦                       ->  Index Only Scan using idx_publicsuffix_suffix3 on publicSuffix ps  (cost=0.28..585.37 rows=7006 width=32) (actual time=0.015..2.798 rows=8354 loops=1) ¦
¦                             Heap Fetches: 8354                                                                                                                                   ¦
¦                             Buffers: shared hit=3967                                                                                                                             ¦
¦ Planning time: 0.471 ms                                                                                                                                                          ¦
¦ Execution time: 166.054 ms                                                                                                                                                       ¦
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

答案 3 :(得分:0)

怎么样:

SELECT 
  DISTINCT ON ("companyDomain".id) "companyDomain".domain, 
  "publicSuffix".suffix 
FROM 
  "companyDomain" 
  INNER JOIN "publicSuffix" ON RIGHT(
    domain, 
    - POSITION('.' IN domain) + 1
  ) = "publicSuffix".suffix 
ORDER BY 
  "companyDomain".id, 
  LENGTH("publicSuffix".suffix) DESC;

我们获得第一个.在域中的位置,然后使用该值的负值(+1以包含第一个.)从RIGHT提取后缀到左。

看起来它的运行速度要快得多,从2500ms到120ms。

Live test