Question

我有2个表格，都是相同的表格

表A： content | text

表B： SubContent | text

表B包含表A中的文本子字符串

例如：

表a：

zupta。。 使南非比以往任何时候都要糟糕。
zur eda terkenang arwah ibu 。。 eda terkenang arwah ibu。。

表B：

制造南非
terkenang arwah ibu

我想达到的目标是：

making south africa | zupta . . for **making south africa** worse than ever.
terkenang arwah ibu | zur eda **terkenang arwah ibu** . . eda terkenang arwah ibu . .

我使用的查询非常简单

select subcontent, content from A join B on  content SIMILAR TO '%'||subcontent||'%'

查询说明：

Gather  (cost=1000.00..6611032.19 rows=1678309 width=180)
  Workers Planned: 2
  ->  Nested Loop  (cost=0.00..6442201.29 rows=699295 width=180)
"        Join Filter: (A.content ~ similar_escape((('%'::text || B.subcontent) || '%'::text), NULL::text))"
        ->  Parallel Seq Scan on A  (cost=0.00..8675.79 rows=186479 width=91)
        ->  Seq Scan on B  (cost=0.00..19.50 rows=750 width=89)

我尝试使用大文本索引，然后关注此博客（Andrew Kane做得很好！）

https://medium.com/@ankane/large-text-indexes-in-postgres-5d7f1677f89f

我尝试在两个列上都添加一个索引，这对执行耗时40分钟没有帮助，有什么想法吗？

使用Laurenz Albe答案后解释

Nested Loop  (cost=27.79..100352.93 rows=1678309 width=180)
  ->  Seq Scan on q2_distinct  (cost=0.00..19.50 rows=750 width=89)
  ->  Bitmap Heap Scan on clean_distinct  (cost=27.79..111.40 rows=2238 width=91)
"        Recheck Cond: (A.content ~ similar_escape((('%'::text || q2_distinct.part) || '%'::text), NULL::text))"
        ->  Bitmap Index Scan on "clean_distinct_Post content_idx"  (cost=0.00..27.23 rows=2238 width=0)
"              Index Cond: (A.content ~ similar_escape((('%'::text || B.content) || '%'::text), NULL::text))"

Answer 1

任何不具有等于运算符（=）的联接条件的联接都只能使用嵌套循环联接。

可能唯一有用的索引是A上的三字母组合索引：

CREATE EXTENSION pg_trgm;
CREATE INDEX ON "A" USING gin (content gin_trgm_ops);

但是您不应期望通过这样的查询获得良好的执行时间。

Answer 2

Postgres的内置textsearch functions可以在这里提供帮助。但是您将需要在表A中添加第二列以帮助为其编制索引。您还需要了解表A中每个字符串的语言。

示例查询。

postgres=# select to_tsvector('english', 'Big cats eat fish on Mondays.') 
               @@ phraseto_tsquery('english', 'eat fish on monday') as query;
 query
-------
 t

请注意，Mondays的复数和大写字母并未停止比赛。

使用to_tsvector函数创建内容的已解析形式，可将其用于GIN索引。

postgres=# select to_tsvector('english', 'Big cats eat fish on Mondays.');
               to_tsvector
---------------------------------------------
 'big':1 'cat':2 'eat':3 'fish':4 'monday':6

请注意to_tsvector如何规范化字符串（更改为小写字母，删除标点符号，删除复数形式以及删除诸如“ on”之类的低价值单词）。这意味着“如果星期一吃鱼”也将匹配（因为“如果”也是一个低价值的词）。因此，如果您只需要完全匹配，则需要使用LIKE运算符添加第二个检查。但这比没有tsvector GIN索引的运行频率要少得多。

所以您的查询现在看起来像

SELECT subcontent, content 
FROM A 
JOIN B ON (
    A.tsv_content @@ phraseto_tsquery(B.lang, B.subcontent) 
    AND content LIKE '%' || subcontent || '%'
);

NB。 phraseto_tsquery要求9.6 +

如何在Postgres中优化子字符串搜索查询？

2 个答案: