大型文档的全文本搜索

时间:2019-06-11 14:58:40

标签: postgresql

当我遇到性能问题时,我想找到一种方法来优化包含简历的物化视图的全文搜索。任何最佳做法将不胜感激。

我的查询如下:

select
  r.id,
  r.first_name,
  r.last_name

from user.candidates_mv r

where r.ts_resume @@ to_tsquery('simple', 'nurse | doctor')

这大约需要1或2分钟才能运行,重新运行将降至1.5秒。我不太可能依靠缓存,因为tsquery是由正在搜索的最终用户提供的,并且通常是新鲜的。

  • 候选人表中约有2,000,000条记录
  • 每张简历的长度约为10,000个字符
  • ts_resume字段是带有GIN索引的简单tsvector
  • tsqueries可能很长,由许多“或”,“ and”以及彼此相邻的词(例如“专家<->护士”
  • )组成。

长tsquery的示例:

'((mechanical <-> design) | (mechanical <-> engineering) | (mechanical <-> engineer) | (development <-> engineer) | (senior <-> design) | (principal <-> design) | (principle <-> design) | (lead <-> design) | (chief <-> design) | (senior <-> mechanical) | (principal <-> mechanical) | (principle <-> mechanical) | (chief <-> mechanical) | (lead <-> mechanical) | (product <-> design) | (senior <-> product) | (principal <-> product) | (principle <-> product) | (lead <-> product) | (chief <-> product)) & ((solidworks) | (solid <-> works)) & ((special <-> purpose <-> machinery) | (special <-> purpose <-> machine) | (special <-> purpose <-> machines) | (high <-> speed <-> machinery) | (high <-> speed <-> machines) | (high <-> speed <-> machines) | (high-speed <-> machine) | (high-speed <-> machines) | (high-speed <-> machinery) | (automated <-> machinery) | (automated <-> machine) | (automated <-> machinery) | (automated <-> machine) | (automated <-> machines) | (automation <-> machinery) | (automation <-> solution) | (automation <-> solutions) | (automated <-> solution) | (automated <-> solutions) | (conveyor <-> system) | (conveyor <-> systems) | (motors) | (servo) | (gearbox) | (gearboxes) | (drive <-> system) | (drive <-> systems) | (test <-> rig) | (test-rig) | (test <-> rigs) | (test-rigs))'

结合起来,这似乎为数据库减少了大约1万个结果的工作量。

任何帮助,替代策略或附加条款的地方。

我也可以增加RAM / HDD并更新任何相关的pg_settings。当前统计数据为:8GB RAM和100GB HD空间,设置已根据pg_tune决定

编辑:添加了解释查询,仅供参考,因为我以前运行过搜索,因此已缓存

Bitmap Heap Scan on entities.candidates_mv c  (cost=10000001184.01..10000001191.96 rows=2 width=8) (actual time=187.902..233.306 rows=2656 loops=1)
  Output: c.id
  Recheck Cond: (c.ts_description @@ '( ''design'' <-> ''engineer'' | ''mechanical'' <-> ''design'' | ''mechanical'' <-> ''engineering'' | ''mechanical'' <-> ''engineer'' | ''development'' <-> ''engineer'' | ''senior'' <-> ''design'' | ''principal'' <-> ''design'' | ''principle'' <-> ''design'' | ''lead'' <-> ''design'' | ''chief'' <-> ''design'' | ''senior'' <-> ''mechanical'' | ''principal'' <-> ''mechanical'' | ''principle'' <-> ''mechanical'' | ''chief'' <-> ''mechanical'' | ''lead'' <-> ''mechanical'' | ''product'' <-> ''design'' | ''senior'' <-> ''product'' | ''principal'' <-> ''product'' | ''principle'' <-> ''product'' | ''lead'' <-> ''product'' | ''chief'' <-> ''product'' ) & ( ''solidworks'' | ''solid'' <-> ''works'' ) & ( ''injection'' | ''injected'' ) & ( ''mold'' | ''mould'' | ''molds'' | ''moulds'' | ''molding'' | ''moulding'' | ''moldings'' | ''mouldings'' | ''moldable'' | ''mouldable'' | ''molded'' | ''moulded'' )'::tsquery)
  Rows Removed by Index Recheck: 288
  Heap Blocks: exact=2471
  Buffers: shared hit=18560
  ->  Bitmap Index Scan on candidates_mv_ts_description_idx  (cost=0.00..1184.01 rows=2 width=0) (actual time=186.334..186.334 rows=2944 loops=1)
    Index Cond: (c.ts_description @@ '( ''design'' <-> ''engineer'' | ''mechanical'' <-> ''design'' | ''mechanical'' <-> ''engineering'' | ''mechanical'' <-> ''engineer'' | ''development'' <-> ''engineer'' | ''senior'' <-> ''design'' | ''principal'' <-> ''design'' | ''principle'' <-> ''design'' | ''lead'' <-> ''design'' | ''chief'' <-> ''design'' | ''senior'' <-> ''mechanical'' | ''principal'' <-> ''mechanical'' | ''principle'' <-> ''mechanical'' | ''chief'' <-> ''mechanical'' | ''lead'' <-> ''mechanical'' | ''product'' <-> ''design'' | ''senior'' <-> ''product'' | ''principal'' <-> ''product'' | ''principle'' <-> ''product'' | ''lead'' <-> ''product'' | ''chief'' <-> ''product'' ) & ( ''solidworks'' | ''solid'' <-> ''works'' ) & ( ''injection'' | ''injected'' ) & ( ''mold'' | ''mould'' | ''molds'' | ''moulds'' | ''molding'' | ''moulding'' | ''moldings'' | ''mouldings'' | ''moldable'' | ''mouldable'' | ''molded'' | ''moulded'' )'::tsquery)
    Buffers: shared hit=2379
Planning time: 0.300 ms
Execution time: 233.996 ms

1 个答案:

答案 0 :(得分:0)

您可以尝试RUM索引扩展名:https://github.com/postgrespro/rum

RUM索引与GIN相似,但包含位置信息,这将有助于短语搜索。