PostgreSQL:使用数组在xml上优化xmlexists的索引

时间:2018-02-05 12:09:45

标签: postgresql xpath

我正在将我的应用程序从MS SQL移植到PostgreSQL 10.1,我一直在处理XML字段。 我改变了#34; exists()"到" xmlexists()"在我的查询中,所以典型的查询现在看起来像:

SELECT t."id", t."fullname"
FROM "candidates" t
WHERE xmlexists('//assignments/assignment/project_id[.=''6512779208625374885'']'
           PASSING BY REF t.assignments );

假设"赋值" column包含具有以下结构的XML数据:

<assignments>
<assignment>
    <project_id>6512779208625374885</project_id>
    <start_date>2018-02-05T14:30:06+00:00</start_date>
    <state_id>1</state_id>
</assignment>
<assignment>
    <project_id>7512979208625374996</project_id>
    <start_date>2017-12-01T15:30:00+00:00</start_date>
    <state_id>0</state_id>
</assignment>
<assignment>
    <project_id>5522979707625370402</project_id>
    <start_date>2017-12-15T10:00:00+00:00</start_date>
    <state_id>1</state_id>
</assignment>

问题是如何为这种类型的查询构建有效的索引。我明白没有像MS SQL那样的通用xpath索引,所以我需要构建一个特定的索引。但我设法找到的所有例子(例如Postgresql 9.x: Index to optimize `xpath_exists` (XMLEXISTS) queries)都是关于嵌套字段,而不是数组。

P.S。我尝试从XML切换到JSONB,但这需要使用带有连接的jsonb_array_elements()重写大量查询,我想避免这种情况。

1 个答案:

答案 0 :(得分:2)

您可以利用xpath()返回数组的事实。

以下表达式:

xpath('/assignments/assignment/project_id/text()', assignments)::text[]

返回包含所有项目ID的字符串数组。可以将此表达式编入索引:

create index on candidates using gin ((xpath('/assignments/assignment/project_id/text()', assignments)::text[]));

以下查询可以使用该索引:

select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['6512779208625374885'];

@>是GIN索引支持数组的“包含”运算符。

您可以使用它来检查具有单一条件的多个ID:

select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['6512779208625374885', '6512779208625374886'];

以上内容将返回XML中包含两者 project_ids的行。

如果您使用“重叠”运算符&&,您还可以搜索包含任何元素的行:

select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] && array['6512779208625374885', '6512779208625374886'];

上面的内容返回XML中至少包含一个project_id的行。

有关数组运算符的更多详细信息,请参阅the manual

缺点是,GIN索引比BTree索引更大,维护成本更高。

我通过以下测试设置验证了这一点:

create table candidates
(
  id integer,
  assignments  xml
);

insert into candidates
select i, format('<assignments>
                    <assignment>
                        <project_id>%s</project_id>
                        <start_date>2018-02-05T14:30:06+00:00</start_date>
                        <state_id>1</state_id>
                    </assignment>
                    <assignment>
                        <project_id>%s</project_id>
                        <start_date>2017-12-01T15:30:00+00:00</start_date>
                        <state_id>0</state_id>
                    </assignment>
                    <assignment>
                        <project_id>%s</project_id>
                        <start_date>2017-12-15T10:00:00+00:00</start_date>
                        <state_id>1</state_id>
                    </assignment></assignments>', i, 10000000 + i, 20000000 + i)::xml
from generate_series(1,1000000) as i;

因此,表candidates现在包含一百万行,每行包含3个不同的project_id。

explain (analyze, buffers)
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['10000042'];

显示以下计划:

QUERY PLAN                                                                                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test.candidates  (cost=29.25..6604.48 rows=5000 width=473) (actual time=0.032..0.032 rows=1 loops=1)                             
  Output: id, assignments                                                                                                                             
  Recheck Cond: ((xpath('/assignments/assignment/project_id/text()'::text, candidates.assignments, '{}'::text[]))::text[] @> '{10000047}'::text[])    
  Heap Blocks: exact=1                                                                                                                                
  Buffers: shared hit=5                                                                                                                               
  ->  Bitmap Index Scan on candidates_xpath_idx  (cost=0.00..28.00 rows=5000 width=0) (actual time=0.028..0.028 rows=1 loops=1)                       
        Index Cond: ((xpath('/assignments/assignment/project_id/text()'::text, candidates.assignments, '{}'::text[]))::text[] @> '{10000047}'::text[])
        Buffers: shared hit=4                                                                                                                         
Planning time: 0.162 ms                                                                                                                               
Execution time: 0.078 ms                                                                                                                              

搜索一百万个XML文档不到十分之一毫秒似乎并不太糟糕。