Question

我有一个非常复杂的JSONB存储在一个jsonb列中。

DB表看起来像：

 CREATE TABLE sites (
   id text NOT NULL,
   doc jsonb,
   PRIMARY KEY (id)
 )

我们存储在doc列中的数据是一个复杂的嵌套JSONB数据：

   {
      "_id": "123",
      "type": "Site",
      "identification": "Custom ID",
      "title": "SITE 1",
      "address": "UK, London, Mr Tom's street, 2",
      "buildings": [
          {
               "uuid": "12312",
               "identification": "Custom ID",
               "name": "BUILDING 1",
               "deposits": [
                   {
                      "uuid": "12312",
                      "identification": "Custom ID",             
                      "audits": [
                          {
                             "uuid": "12312",         
                              "sample_id": "SAMPLE ID"                
                          }
                       ]
                   }
               ]
          } 
       ]
    }

因此JSONB的结构如下：

SITE 
  -> ARRAY OF BUILDINGS
     -> ARRAY OF DEPOSITS
       -> ARRAY OF AUDITS

我们需要通过每种类型的条目中的某些值实现全文搜索：

SITE (identification, title, address)
BUILDING (identification, name)
DEPOSIT (identification)
AUDIT (sample_id)

SQL查询应仅在这些字段值中运行全文搜索。

我想需要使用GIN索引和tsvector之类的东西，但是没有足够的Postgresql背景。

所以，我的问题是可以索引然后查询这种嵌套的JSONB结构吗？

Answer 1

我们添加tsvector类型的新列：

alter table sites add column tsvector tsvector;

现在让我们创建一个触发器来收集lexems，组织它们并放到我们的tsvector中。我们将使用4组（A，B，C，D） - 这是一个特殊的tsvector的功能，允许您在搜索时稍后区分词典（参见手册https://www.postgresql.org/docs/current/static/textsearch-controls.html中的示例;不幸的是，此功能仅支持对于4组，因为开发人员只保留2位，但我们很幸运，我们只需要4组）：

create or replace function t_sites_tsvector() returns trigger as $$
declare
  dic regconfig;
  part_a text;
  part_b text;
  part_c text;
  part_d text;
begin
  dic := 'simple'; -- change if you need more advanced word processing (stemming, etc)

  part_a := coalesce(new.doc->>'identification', '') || ' ' || coalesce(new.doc->>'title', '') || ' ' || coalesce(new.doc->>'address', '');

  select into part_b string_agg(coalesce(a, ''), ' ') || ' ' || string_agg(coalesce(b, ''), ' ')
  from (
    select 
      jsonb_array_elements((new.doc->'buildings'))->>'identification',
      jsonb_array_elements((new.doc->'buildings'))->>'name'
  ) _(a, b);

  select into part_c string_agg(coalesce(c, ''), ' ')
  from (
    select jsonb_array_elements(b)->>'identification' from (
      select jsonb_array_elements((new.doc->'buildings'))->'deposits'
    ) _(b)
  ) __(c);

  select into part_d string_agg(coalesce(d, ''), ' ')
  from (
    select jsonb_array_elements(c)->>'sample_id'
    from (
      select jsonb_array_elements(b)->'audits' from (
        select jsonb_array_elements((new.doc->'buildings'))->'deposits'
      ) _(b)
    ) __(c)
  ) ___(d);

  new.tsvector := setweight(to_tsvector(dic, part_a), 'A')
    || setweight(to_tsvector(dic, part_b), 'B')
    || setweight(to_tsvector(dic, part_c), 'C')
    || setweight(to_tsvector(dic, part_d), 'D')
  ;
  return new;
end;
$$ language plpgsql immutable;

create trigger t_sites_tsvector
  before insert or update on sites for each row execute procedure t_sites_tsvector();

^^ - 滚动它，这个片段比它看起来更大（特别是你有没有滚动条的MacOS ......）

现在让我们创建GIN索引来加速搜索查询（如果你有很多行 - 比如说，超过数百或数千行），这是有意义的：

create index i_sites_fulltext on sites using gin(tsvector);

现在我们插入一些东西来检查：

insert into sites select 1, '{
      "_id": "123",
      "type": "Site",
      "identification": "Custom ID",
      "title": "SITE 1",
      "address": "UK, London, Mr Tom''s street, 2",
      "buildings": [
          {
               "uuid": "12312",
               "identification": "Custom ID",
               "name": "BUILDING 1",
               "deposits": [
                   {
                      "uuid": "12312",
                      "identification": "Custom ID",
                      "audits": [
                          {
                             "uuid": "12312",
                              "sample_id": "SAMPLE ID"
                          }
                       ]
                   }
               ]
          }
       ]
    }'::jsonb;

选中select * from sites; - 您必须看到tsvector列中包含一些数据。

现在让我们查询一下：

select * from sites where tsvector @@ to_tsquery('simple', 'sample');

- 它必须返回我们的记录。在这种情况下，我们搜索'sample'字词，我们不关心它将在哪个组中找到。

让我们改变它并尝试仅在A组中搜索（“SITE（标识，标题，地址）”，如您所述）：

select * from sites where tsvector @@ to_tsquery('simple', 'sample:A');

- 这必须不返回任何内容，因为单词'sample'仅位于D组（“AUDIT（sample_id）”）。事实上：

select * from sites where tsvector @@ to_tsquery('simple', 'sample:D');

- 将再次将我们的记录归还给我们。

请注意，您需要使用to_tsquery(..)，而不是plainto_tsquery(..)来解决4个群组问题。因此，您需要自行清理输入（避免使用或删除&和|等特殊字符，因为它们在tsquery值中具有特殊含义。）

好消息是，您可以在一个查询中组合不同的组，如下所示：

select * from sites where tsvector @@ to_tsquery('simple', 'sample:D & london:A');

另一种方法（例如，如果你必须使用超过4组）有多个tsvectors，每个tsvectors坐在一个单独的列中，使用单个查询构建它们，创建索引（你可以在多个上创建单个索引） tsvector列）和查询单独列的查询。它与我上面解释的相似，但可能效率较低。

希望这有帮助。

Answer 2

在Postgres 10中，事情似乎更简单一些，因为to_tsvector函数支持json。例如，这很好用：

UPDATE dataset SET search_vector = to_tsvector('english',
'{
  "abstract":"Abstract goes here",
  "useConstraints":"None",
  "dataQuality":"Good",
  "Keyword":"historic",
  "topicCategory":"Environment",
  "responsibleOrganisation":"HES"
}'::json)
where dataset_id = 4;

请注意，我没有在深度嵌套的结构上尝试过此方法，但看不出为什么它不起作用

如何在Postgresql中对复杂的嵌套JSONB实现全文搜索

2 个答案: