此查询查找域的后缀:
@LazySingleton
编辑:请注意,这也适用于子域。
您可以摆弄示例here,并使用pev可视化计划。我尝试将覆盖索引添加到表中,但最终它们未被查询计划人员使用。也许还有另一个查询可能更有效?
答案 0 :(得分:2)
索引对于您的数据结构/查询没有任何优势。只是想像一下如何在这里使用索引。我没有运气。
我的建议是将域/后缀转换为类似的数组
alter table "companyDomain" add column adomain text[];
update "companyDomain" set adomain = string_to_array(domain, '.');
create index idx_adom on "companyDomain" using gin (adomain array_ops);
alter table "publicSuffix" add column asuffix text[];
update "publicSuffix" set asuffix = string_to_array(ltrim(suffix, '.'), '.');
create index idx_asuffix on "publicSuffix" using gin (asuffix array_ops);
让我们比较这些查询:
ostgres=# explain (analyze, verbose, buffers)
SELECT DISTINCT ON ("companyDomain".id)
"companyDomain".domain,
"publicSuffix".suffix
FROM
"companyDomain"
INNER JOIN "publicSuffix" ON REVERSE("companyDomain".domain) LIKE REVERSE("publicSuffix".suffix) || '%'
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique (cost=185738.35..185940.72 rows=908 width=31) (actual time=2364.720..2364.890 rows=908 loops=1) │
│ Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text)) │
│ Buffers: shared hit=306 │
│ -> Sort (cost=185738.35..185839.53 rows=40474 width=31) (actual time=2364.719..2364.764 rows=1006 loops=1) │
│ Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text)) │
│ Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC │
│ Sort Method: quicksort Memory: 103kB │
│ Buffers: shared hit=306 │
│ -> Nested Loop (cost=0.00..182641.13 rows=40474 width=31) (actual time=22.735..2364.484 rows=1006 loops=1) │
│ Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, length(("publicSuffix".suffix)::text) │
│ Join Filter: (reverse(("companyDomain".domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text)) │
│ Rows Removed by Join Filter: 8093814 │
│ Buffers: shared hit=306 │
│ -> Seq Scan on public."publicSuffix" (cost=0.00..377.15 rows=8915 width=12) (actual time=0.081..0.794 rows=8915 loops=1) │
│ Output: "publicSuffix".id, "publicSuffix".suffix, "publicSuffix".created_at, "publicSuffix".asuffix │
│ Buffers: shared hit=288 │
│ -> Materialize (cost=0.00..31.62 rows=908 width=15) (actual time=0.001..0.036 rows=908 loops=8915) │
│ Output: "companyDomain".domain, "companyDomain".id │
│ Buffers: shared hit=18 │
│ -> Seq Scan on public."companyDomain" (cost=0.00..27.08 rows=908 width=15) (actual time=11.576..11.799 rows=908 loops=1) │
│ Output: "companyDomain".domain, "companyDomain".id │
│ Buffers: shared hit=18 │
│ Planning Time: 0.167 ms │
│ JIT: │
│ Functions: 9 │
│ Options: Inlining false, Optimization false, Expressions true, Deforming true │
│ Timing: Generation 1.956 ms, Inlining 0.000 ms, Optimization 0.507 ms, Emission 10.878 ms, Total 13.341 ms │
│ Execution Time: 2366.971 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
据我所知,这里的瓶颈Rows Removed by Join Filter: 8093814
似乎PostgreSQL建立了表的笛卡尔连接,然后使用ON
条件对其进行过滤:
select count(*) from "companyDomain", "publicSuffix";
---
8094820
要解决此问题,请尝试使用array operator:
postgres=# explain (analyze, verbose, buffers)
SELECT DISTINCT ON ("companyDomain".id)
"companyDomain".domain,
"publicSuffix".suffix
FROM
"companyDomain"
INNER JOIN "publicSuffix" ON adomain @> asuffix
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique (cost=8310.60..8512.97 rows=908 width=31) (actual time=180.149..180.335 rows=908 loops=1) │
│ Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text)) │
│ Buffers: shared hit=48986 │
│ -> Sort (cost=8310.60..8411.78 rows=40474 width=31) (actual time=180.148..180.200 rows=1239 loops=1) │
│ Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, (length(("publicSuffix".suffix)::text)) │
│ Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC │
│ Sort Method: quicksort Memory: 145kB │
│ Buffers: shared hit=48986 │
│ -> Nested Loop (cost=0.59..5213.39 rows=40474 width=31) (actual time=0.190..179.693 rows=1239 loops=1) │
│ Output: "companyDomain".domain, "publicSuffix".suffix, "companyDomain".id, length(("publicSuffix".suffix)::text) │
│ Buffers: shared hit=48986 │
│ -> Seq Scan on public."companyDomain" (cost=0.00..27.08 rows=908 width=57) (actual time=0.015..0.098 rows=908 loops=1) │
│ Output: "companyDomain".id, "companyDomain".domain, "companyDomain".created_at, "companyDomain".adomain │
│ Buffers: shared hit=18 │
│ -> Bitmap Heap Scan on public."publicSuffix" (cost=0.59..5.15 rows=45 width=54) (actual time=0.052..0.197 rows=1 loops=908) │
│ Output: "publicSuffix".id, "publicSuffix".suffix, "publicSuffix".created_at, "publicSuffix".asuffix │
│ Recheck Cond: ("companyDomain".adomain @> "publicSuffix".asuffix) │
│ Rows Removed by Index Recheck: 572 │
│ Heap Blocks: exact=41510 │
│ Buffers: shared hit=48968 │
│ -> Bitmap Index Scan on idx_asuffix (cost=0.00..0.58 rows=45 width=0) (actual time=0.039..0.039 rows=573 loops=908) │
│ Index Cond: ("publicSuffix".asuffix <@ "companyDomain".adomain) │
│ Buffers: shared hit=7458 │
│ Planning Time: 0.189 ms │
│ Execution Time: 180.434 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
它可能不太准确(例如aaa.bbb
等于bbb.aaa
),但是您可以在WHERE
子句中对其进行修复。无论如何,它都会更快。
现在,旧的domain
和suffix
列是多余的,因为您可以使用array_to_string(anyarray, text [, text])
function从adomain/asuffix
还原它们。
作为一种选择,为避免更改表结构,可以在string_to_array()
上创建功能索引,然后在过滤器/联接中使用它。
答案 1 :(得分:2)
您是否考虑过使用gin
索引?
我对您的示例DML进行了以下修改:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
...
CREATE INDEX companyDomain_domain_reverse ON "companyDomain" USING gin (REVERSE(domain) gin_trgm_ops);
...
CREATE INDEX publicSuffix_suffix_reverse ON "publicSuffix" USING gin (REVERSE(suffix) gin_trgm_ops);
这是查询计划:
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|Unique (cost=40802.07..41004.44 rows=908 width=31) (actual time=98.229..98.356 rows=908 loops=1) |
| -> Sort (cost=40802.07..40903.26 rows=40474 width=31) (actual time=98.228..98.264 rows=1006 loops=1) |
| Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC |
| Sort Method: quicksort Memory: 103kB |
| -> Nested Loop (cost=0.05..37704.86 rows=40474 width=31) (actual time=1.655..97.976 rows=1006 loops=1) |
| -> Seq Scan on "publicSuffix" (cost=0.00..151.15 rows=8915 width=12) (actual time=0.011..0.728 rows=8915 loops=1) |
| -> Bitmap Heap Scan on "companyDomain" (cost=0.05..4.15 rows=5 width=15) (actual time=0.010..0.010 rows=0 loops=8915) |
| Recheck Cond: (reverse((domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text)) |
| Rows Removed by Index Recheck: 0 |
| Heap Blocks: exact=301 |
| -> Bitmap Index Scan on companydomain_domain_reverse (cost=0.00..0.05 rows=5 width=0) (actual time=0.010..0.010 rows=0 loops=8915)|
| Index Cond: (reverse((domain)::text) ~~ (reverse(("publicSuffix".suffix)::text) || '%'::text)) |
|Planning Time: 0.150 ms |
|Execution Time: 98.439 ms |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
作为奖励-您甚至不需要REVERSE()
索引和查询中的文本:
create index companydomain_domain
on "companyDomain" using gin(domain gin_trgm_ops);
SELECT DISTINCT ON ("companyDomain".id) "companyDomain".domain, "publicSuffix".suffix
FROM "companyDomain"
INNER JOIN "publicSuffix" ON "companyDomain".domain LIKE '%' || "publicSuffix".suffix
ORDER BY "companyDomain".id, LENGTH("publicSuffix".suffix) DESC
查询花费相同的时间,但仍使用gin索引:
+------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|Unique (cost=40556.91..40759.28 rows=908 width=31) (actual time=96.170..96.315 rows=908 loops=1) |
| -> Sort (cost=40556.91..40658.10 rows=40474 width=31) (actual time=96.169..96.209 rows=1006 loops=1) |
| Sort Key: "companyDomain".id, (length(("publicSuffix".suffix)::text)) DESC |
| Sort Method: quicksort Memory: 103kB |
| -> Nested Loop (cost=0.05..37459.70 rows=40474 width=31) (actual time=1.764..95.919 rows=1006 loops=1) |
| -> Seq Scan on "publicSuffix" (cost=0.00..151.15 rows=8915 width=12) (actual time=0.009..0.711 rows=8915 loops=1) |
| -> Bitmap Heap Scan on "companyDomain" (cost=0.05..4.12 rows=5 width=15) (actual time=0.010..0.010 rows=0 loops=8915) |
| Recheck Cond: ((domain)::text ~~ ('%'::text || ("publicSuffix".suffix)::text)) |
| Rows Removed by Index Recheck: 0 |
| Heap Blocks: exact=301 |
| -> Bitmap Index Scan on companydomain_domain (cost=0.00..0.05 rows=5 width=0) (actual time=0.010..0.010 rows=0 loops=8915)|
| Index Cond: ((domain)::text ~~ ('%'::text || ("publicSuffix".suffix)::text)) |
|Planning Time: 0.132 ms |
|Execution Time: 96.393 ms |
+------------------------------------------------------------------------------------------------------------------------------------------------+
PS:我想您只需要一个索引-在这种情况下:companyDomain_domain_reverse
答案 2 :(得分:1)
您想要像这样的比赛
'something.google.com' like '%google.com'
但是您知道PostgreSQL不会为此使用索引,因为模式字符串以通配符开头。因此,您将两个字符串都反转了:
'moc.elgoog.gnihtemos' like 'moc.elgoog%'
并在REVERSE("companyDomain".domain)
上创建函数索引。
这是一个很好的主意,但是PostgreSQL不使用您的索引。这是因为DBMS不知道您的字符串中包含什么(因为这是表数据,并且DBMS不会先读取整个表才能制定计划)。在最坏的情况下,所有反向后缀均以'%'
开头。如果DBMS在这种情况下决定通过索引,则速度可能会非常慢。 您知道后缀不会以'%'
结尾,但是DBMS不会,并且决定制定安全计划(全表扫描)。
在此处记录:https://www.postgresql.org/docs/9.2/indexes-types.html
优化器还可以使用B树索引进行涉及模式匹配运算符LIKE和〜如果模式是常量的查询 ...
我没有办法说服PostgreSQL使用索引是安全的。 AND REVERSE("publicSuffix".suffix) || '%' NOT LIKE '/%%' ESCCAPE '/'
并没有帮助。
我认为,最好的办法是在RIGHT(domain, 3)
和RIGHT(suffix, 3)
上使用索引,因为我们知道包括点在内的后缀至少应包含三个字符。这样可以将匹配范围缩小到足够有用。
CREATE INDEX idx_publicSuffix_suffix3 ON "publicSuffix"(RIGHT(suffix, 3) varchar_pattern_ops, suffix);
CREATE INDEX idx_companyDomain_domain3 ON "companyDomain"(RIGHT(domain, 3) varchar_pattern_ops, id, domain);
SELECT DISTINCT ON (cd.id)
cd.domain,
ps.suffix
FROM "companyDomain" cd
JOIN "publicSuffix" ps ON cd.domain LIKE '%' || ps.suffix
AND RIGHT(cd.domain, 3) = RIGHT(ps.suffix, 3)
ORDER BY cd.id, LENGTH(ps.suffix) DESC;
演示:https://www.db-fiddle.com/f/dPpVFWjpVJHYFnVut4k7wS/1
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ¦ QUERY PLAN ¦ +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ¦ Unique (cost=1684.72..1685.71 rows=198 width=72) (actual time=165.676..165.882 rows=908 loops=1) ¦ ¦ Buffers: shared hit=4079 ¦ ¦ -> Sort (cost=1684.72..1685.22 rows=198 width=72) (actual time=165.675..165.723 rows=1006 loops=1) ¦ ¦ Sort Key: cd.id, (length((ps.suffix)::text)) DESC ¦ ¦ Sort Method: quicksort Memory: 103kB ¦ ¦ Buffers: shared hit=4079 ¦ ¦ -> Merge Join (cost=0.56..1677.17 rows=198 width=72) (actual time=0.090..165.222 rows=1006 loops=1) ¦ ¦ Buffers: shared hit=4076 ¦ ¦ -> Index Only Scan using idx_companydomain_domain3 on companyDomain cd (cost=0.28..93.23 rows=1130 width=36) (actual time=0.018..0.429 rows=908 loops=1) ¦ ¦ Heap Fetches: 908 ¦ ¦ Buffers: shared hit=109 ¦ ¦ -> Materialize (cost=0.28..602.89 rows=7006 width=32) (actual time=0.019..47.510 rows=390620 loops=1) ¦ ¦ Buffers: shared hit=3967 ¦ ¦ -> Index Only Scan using idx_publicsuffix_suffix3 on publicSuffix ps (cost=0.28..585.37 rows=7006 width=32) (actual time=0.015..2.798 rows=8354 loops=1) ¦ ¦ Heap Fetches: 8354 ¦ ¦ Buffers: shared hit=3967 ¦ ¦ Planning time: 0.471 ms ¦ ¦ Execution time: 166.054 ms ¦ +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
答案 3 :(得分:0)
怎么样:
SELECT
DISTINCT ON ("companyDomain".id) "companyDomain".domain,
"publicSuffix".suffix
FROM
"companyDomain"
INNER JOIN "publicSuffix" ON RIGHT(
domain,
- POSITION('.' IN domain) + 1
) = "publicSuffix".suffix
ORDER BY
"companyDomain".id,
LENGTH("publicSuffix".suffix) DESC;
我们获得第一个.
在域中的位置,然后使用该值的负值(+1以包含第一个.
)从RIGHT
提取后缀到左。
看起来它的运行速度要快得多,从2500ms到120ms。