改善PostgreSQL中多个JSONB字段搜索的排名时间

时间:2016-12-08 15:08:49

标签: postgresql indexing full-text-search ranking jsonb

我的搜索时间实际上非常快,但是一旦我开始对它们进行排名以获得最佳效果,我就会碰壁。我获得的点击次数越多,得到的速度就越慢。对于不常见的术语,搜索需要大约2毫秒,而对于更常见的术语,它需要大约900毫秒+。在示例中,我收集了数据中的所有可能结构(简单,数组,嵌套数组)。

CREATE TABLE book (
  id   BIGSERIAL NOT NULL,
  data JSONB     NOT NULL
);

然后我构建了一个连接我的嵌套数组字段的名称值的函数' author':

CREATE OR REPLACE FUNCTION author_function(
  IN  data        JSONB,
  OUT resultNames TSVECTOR
)
  RETURNS TSVECTOR AS $$
DECLARE
  authorRecords   RECORD;
  combinedAuthors JSONB [];
  singleAuthor    JSONB;
BEGIN
  FOR authorRecords IN (SELECT value
                        FROM jsonb_array_elements(data #> '{authors}'))
  LOOP
    combinedAuthors := combinedAuthors || authorRecords.value;
  END LOOP;
  FOREACH singleAuthor IN ARRAY coalesce(combinedAuthors, '{}')
  LOOP
    resultNames := resultNames ||
                   coalesce(to_tsvector('english', singleAuthor ->> 'name'), to_tsvector('english', ''));
  END LOOP;
END; $$
LANGUAGE plpgsql
IMMUTABLE;

我需要一个函数,我可以为多个连接字段构建索引:

CREATE OR REPLACE FUNCTION multi_field_function(
  IN data JSONB
)
  RETURNS TSVECTOR AS $$
BEGIN
  RETURN
  coalesce(to_tsvector('english', data ->> 'title'),
           to_tsvector('english', '')) ||
  coalesce(to_tsvector('english', data ->> 'subtitles'),
           to_tsvector('english', '')) ||
  coalesce(author_function(data),
           to_tsvector('english', ''));
END; $$
LANGUAGE plpgsql
IMMUTABLE;

现在我需要建立指数。

CREATE INDEX book_title_idx
  ON book USING GIN (to_tsvector('english', book.data ->> 'title'));
CREATE INDEX book_subtitle_idx
  ON book USING GIN (to_tsvector('english', book.data ->> 'subtitles'));
CREATE INDEX book_author_idx
  ON book USING GIN (author_function(book.data));
CREATE INDEX book_multi_field_idx
  ON book USING GIN (multi_field_function(book.data));

最后我添加了一些测试数据:

INSERT INTO book (data)
VALUES (CAST('{"title": "Cats",' ||
             '"subtitles": ["Cats", "Dogs"],' ||
             '"author": [{"id": 0, "name": "Cats"}, ' ||
             '           {"id": 1, "name": "Dogs"}]}' AS JSONB));
INSERT INTO book (data)
VALUES (CAST('{"title": "ats",' ||
             '"subtitles": ["Cats", "ogs"],' ||
             '"author": [{"id": 2, "name": "ats"}, ' ||
             '           {"id": 3, "name": "ogs"}]}' AS JSONB));

当我查询我的multi_field_function时,我会按照我的要求列出结果。

EXPLAIN ANALYZE
SELECT *
FROM (
       SELECT
         id,
         data,
         ts_rank(query, 'cat:*') AS score
       FROM
         book,
             multi_field_function(data) query
       WHERE multi_field_function(data) @@ to_tsquery('cat:*')
       ORDER BY score DESC) a
WHERE score > 0
ORDER BY score DESC;

在我的真实数据上,这会产生以下查询计划。在那里你可以看到排名的最后一步真的很慢。

Sort  (cost=7921.72..7927.87 rows=2460 width=143) (actual time=949.644..952.263 rows=16926 loops=1)
  Sort Key: (ts_rank(query.query, '''cat'':*'::tsquery)) DESC
  Sort Method: external merge  Disk: 4376kB
  ->  Nested Loop  (cost=47.31..7783.17 rows=2460 width=143) (actual time=3.750..933.719 rows=16926 loops=1)
        ->  Bitmap Heap Scan on book  (cost=47.06..7690.67 rows=2460 width=1305) (actual time=3.582..11.904 rows=16926 loops=1)
              Recheck Cond: (multi_field_function(data) @@ to_tsquery('cat:*'::text))
              Heap Blocks: exact=3695
              ->  Bitmap Index Scan on book_multi_field_idx  (cost=0.00..46.45 rows=2460 width=0) (actual time=3.128..3.128 rows=16926 loops=1)
                    Index Cond: (multi_field_function(data) @@ to_tsquery('cat:*'::text))
        ->  Function Scan on multi_field_function query  (cost=0.25..0.27 rows=1 width=32) (actual time=0.049..0.049 rows=1 loops=16926)
              Filter: (ts_rank(query, '''cat'':*'::tsquery) > '0'::double precision)
Planning time: 0.163 ms
Execution time: 953.624 ms

有什么方法可以保留我的json结构,并且仍然可以获得多个字段的良好和快速的搜索结果?

修改 我不得不调整Vao Tsun的查询,因为它无法识别'查询'从内在的FROM。

EXPLAIN ANALYZE
SELECT
  *,
  ts_rank(query, 'cat:*') AS score
FROM (
       SELECT
         id,
         data
       FROM
         book
       WHERE multi_field_function(data) @@ to_tsquery('cat:*')
     ) a,
      multi_field_function(a.data) query
ORDER BY score DESC;

可悲的是,表现并没有太大变化:

Sort  (cost=7880.82..7886.97 rows=2460 width=1343) (actual time=863.542..875.035 rows=16840 loops=1)
  Sort Key: (ts_rank(query.query, '''cat'':*'::tsquery)) DESC
  Sort Method: external merge  Disk: 25280kB
  ->  Nested Loop  (cost=43.31..7742.27 rows=2460 width=1343) (actual time=3.570..821.861 rows=16840 loops=1)
        ->  Bitmap Heap Scan on book  (cost=43.06..7686.67 rows=2460 width=1307) (actual time=3.362..12.085 rows=16840 loops=1)
              Recheck Cond: (multi_field_function(data) @@ to_tsquery('cat:*'::text))
              Heap Blocks: exact=1
              ->  Bitmap Index Scan on book_multi_field_idx  (cost=0.00..42.45 rows=2460 width=0) (actual time=2.934..2.934 rows=16840 loops=1)
                    Index Cond: (multi_field_function(data) @@ to_tsquery('cat:*'::text))
        ->  Function Scan on multi_field_function query  (cost=0.25..0.26 rows=1 width=32) (actual time=0.047..0.047 rows=1 loops=16840)
Planning time: 0.090 ms
Execution time: 879.736 ms

0 个答案:

没有答案