我正在将我的应用程序从MS SQL移植到PostgreSQL 10.1,我一直在处理XML字段。 我改变了#34; exists()"到" xmlexists()"在我的查询中,所以典型的查询现在看起来像:
SELECT t."id", t."fullname"
FROM "candidates" t
WHERE xmlexists('//assignments/assignment/project_id[.=''6512779208625374885'']'
PASSING BY REF t.assignments );
假设"赋值" column包含具有以下结构的XML数据:
<assignments>
<assignment>
<project_id>6512779208625374885</project_id>
<start_date>2018-02-05T14:30:06+00:00</start_date>
<state_id>1</state_id>
</assignment>
<assignment>
<project_id>7512979208625374996</project_id>
<start_date>2017-12-01T15:30:00+00:00</start_date>
<state_id>0</state_id>
</assignment>
<assignment>
<project_id>5522979707625370402</project_id>
<start_date>2017-12-15T10:00:00+00:00</start_date>
<state_id>1</state_id>
</assignment>
问题是如何为这种类型的查询构建有效的索引。我明白没有像MS SQL那样的通用xpath索引,所以我需要构建一个特定的索引。但我设法找到的所有例子(例如Postgresql 9.x: Index to optimize `xpath_exists` (XMLEXISTS) queries)都是关于嵌套字段,而不是数组。
P.S。我尝试从XML切换到JSONB,但这需要使用带有连接的jsonb_array_elements()重写大量查询,我想避免这种情况。
答案 0 :(得分:2)
您可以利用xpath()
返回数组的事实。
以下表达式:
xpath('/assignments/assignment/project_id/text()', assignments)::text[]
返回包含所有项目ID的字符串数组。可以将此表达式编入索引:
create index on candidates using gin ((xpath('/assignments/assignment/project_id/text()', assignments)::text[]));
以下查询可以使用该索引:
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['6512779208625374885'];
@>
是GIN索引支持数组的“包含”运算符。
您可以使用它来检查具有单一条件的多个ID:
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['6512779208625374885', '6512779208625374886'];
以上内容将返回XML中包含两者 project_ids的行。
如果您使用“重叠”运算符&&
,您还可以搜索包含任何元素的行:
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] && array['6512779208625374885', '6512779208625374886'];
上面的内容返回XML中至少包含一个project_id的行。
有关数组运算符的更多详细信息,请参阅the manual
缺点是,GIN索引比BTree索引更大,维护成本更高。
我通过以下测试设置验证了这一点:
create table candidates
(
id integer,
assignments xml
);
insert into candidates
select i, format('<assignments>
<assignment>
<project_id>%s</project_id>
<start_date>2018-02-05T14:30:06+00:00</start_date>
<state_id>1</state_id>
</assignment>
<assignment>
<project_id>%s</project_id>
<start_date>2017-12-01T15:30:00+00:00</start_date>
<state_id>0</state_id>
</assignment>
<assignment>
<project_id>%s</project_id>
<start_date>2017-12-15T10:00:00+00:00</start_date>
<state_id>1</state_id>
</assignment></assignments>', i, 10000000 + i, 20000000 + i)::xml
from generate_series(1,1000000) as i;
因此,表candidates
现在包含一百万行,每行包含3个不同的project_id。
explain (analyze, buffers)
select *
from candidates
where xpath('/assignments/assignment/project_id/text()', assignments)::text[] @> array['10000042'];
显示以下计划:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test.candidates (cost=29.25..6604.48 rows=5000 width=473) (actual time=0.032..0.032 rows=1 loops=1)
Output: id, assignments
Recheck Cond: ((xpath('/assignments/assignment/project_id/text()'::text, candidates.assignments, '{}'::text[]))::text[] @> '{10000047}'::text[])
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on candidates_xpath_idx (cost=0.00..28.00 rows=5000 width=0) (actual time=0.028..0.028 rows=1 loops=1)
Index Cond: ((xpath('/assignments/assignment/project_id/text()'::text, candidates.assignments, '{}'::text[]))::text[] @> '{10000047}'::text[])
Buffers: shared hit=4
Planning time: 0.162 ms
Execution time: 0.078 ms
搜索一百万个XML文档不到十分之一毫秒似乎并不太糟糕。