Question

我有一个看起来像的表（每个表中有一些行数来获得这种配给量）：

expectedreportsnodes（1 000 000行）：

 nodejoinkey   | integer  | not null
 nodeid        | text     | not null
 nodeconfigids | text[]   |

nodeconfigids数组通常包含1-50个值。

还有第二张表：

expectedreports（10 000行）：

 pkid       | integer  | not null
 nodejoinkey| integer  | not null
 ...

我想查询所有预期报告，这些报告在nodeexpectedreports中存在具有给定nodeConfigId的条目。我可能有大量的nodeConfigIds（数千）。

最有效的方法是什么？

现在，我有：

select E.pkid, E.nodejoinkey from expectedreports E 
inner join (
  select NN.nodejoinkey, NN.nodeid, NN.nodeconfigids from (
    select N.nodejoinkey, N.nodeid, unnest(N.nodeconfigids) as nodeconfigids  
    from expectedreportsnodes N
  ) as NN 
  where NN.nodeconfigids) IN( VALUES ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN')  )
  ) as NNN on E.nodejoinkey = NNN.nodejoinkey;

这似乎给出了预期的结果，但需要很长时间才能执行。

如何改善查询？

更新

使用数组重叠和索引的建议答案在我的设置上效率大大降低。我不能说为什么。
以下版本似乎是最快的（同样，也是最不明白的原因 - 也许是因为我在nodeconfigids中的值通常很少？）：

_

select E.pkid, E.nodejoinkey from expectedreports E
inner join (
  select NN.nodejoinkey, NN.nodeconfigids
  from (
    select N.nodejoinkey, N.nodeconfigids, 
           generate_subscripts(N.nodeconfigids,1) as v
    from expectedreportsnodes N
  ) as NN
  where NN.nodeconfigids[v] in(values ('cf1'), ('cf2'), ..., ('cf1000'), ..., ('cfN') )
) as NNN
on E.nodejoinkey = NNN.nodejoinkey

Answer 1

性能的关键是数组列上的GIN index。并与可以使用索引的运营商合作。

CREATE INDEX ern_gin_idx ON expectedreportsnodes USING gin (nodeconfigids);

查询：

SELECT e.pkid, nodejoinkey 
FROM   expectedreports e
JOIN   expectedreportsnodes n USING (nodejoinkey)
WHERE  n.nodeconfigids && '{cf1, cf2, ..., cfN}'::text[];

这应该适用于text的数组，因为默认的GIN运算符类支持overlap operator &&。 Per documentation：

Name        Indexed Data Type  Indexable Operators
...
_text_ops   text[]             && <@ = @>
...

还要确保expectedreports.nodejoinkey上有一个简单的btree索引：

CREATE INDEX expectedreports_nodejoinkey_idx ON expectedreports (nodejoinkey);

使用多列索引进行优化

要进一步优化给定查询，您可以在索引中包含其他无用的列nodejoinkey，以允许仅索引扫描。

要包含integer列，请先安装附加模块btree_gin，该模块提供必要的GIN运算符类。每个数据库运行：

CREATE EXTENSION btree_gin;

然后：

CREATE INDEX ern_multi_gin_idx ON expectedreportsnodes
USING gin (nodejoinkey, nodeconfigids);

相同的查询相关答案以及更多详情：

替代`unnest()`

如果GIN索引不是一个选项（或者不符合您的期望），您仍然可以优化查询。

取消长输入数组（或使用示例中的VALUES表达式），然后加入到派生表，效率特别高效。 IN构造通常是最慢的选项。

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT n.nodejoinkey 
   FROM  (SELECT nodejoinkey, unnest(nodeconfigids) AS nodeconfigid
          FROM   expectedreportsnodes) n
   JOIN  (VALUES ('cf1'), ('cf2'), ..., ('cfN')) t(nodeconfigid) USING (nodeconfigid)
   ) n
JOIN   expectedreports e USING (nodejoinkey);

Postgres中的现代形式 9.3 + ，隐含JOIN LATERAL：

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT n.nodejoinkey 
   FROM  expectedreportsnodes n
       , unnest(n.nodeconfigids) nodeconfigid
   JOIN  unnest('{cf1, cf2, ..., cfN}'::text[]) t(nodeconfigid) USING (nodeconfigid)
   ) n
JOIN   expectedreports e USING (nodejoinkey);

您的原始查询可能会在结果中产生重复的行。折叠DISTINCT。
JOIN LATERAL的详细信息：
- Dynamically execute query using the output of another query

对于短输入数组，`ANY`构造更快：

SELECT e.pkid, nodejoinkey
FROM  (
   SELECT DISTINCT e.nodejoinkey 
   FROM   expectedreportsnodes e
   JOIN   unnest(e.nodeconfigids) u(nodeconfigid) 
          ON u.nodeconfigid = ANY ('{cf1, cf2, ..., cfN}'::text[])
   ) n
JOIN   expectedreports e USING (nodejoinkey);

Answer 2

以下内容避免了数组的取消，可能会更快：

select E.pkid, E.nodejoinkey 
from expectedreports E 
  join expectedreportsnodes nn on E.nodejoinkey = NNN.nodejoinkey
where nn.nodeconfigids && array['cf1', 'cf2', ..., 'cf1000', ..., 'cfN'];

它将返回expectedreportsnodes中的行，其中数组中的任何值都显示在nodeconfigids列中。

查看具有大量输入值的数组

2 个答案:

使用多列索引进行优化

替代`unnest()`

对于短输入数组，`ANY`构造更快：

查看具有大量输入值的数组

2 个答案:

使用多列索引进行优化

替代unnest()

对于短输入数组，ANY构造更快：

替代`unnest()`

对于短输入数组，`ANY`构造更快：