Question

使用Postgres，我有一个包含conversations和conversationUsers的架构。每个conversation有许多conversationUsers。我希望能够找到具有确切指定数量conversationUsers的对话。换句话说，提供一个userIds（例如[1, 4, 6]）数组，我希望能够找到仅包含那些用户，而不再包含这些用户的对话。

到目前为止，我已经尝试过：

SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."userId" IN (1, 4)
GROUP BY c."conversationId"
HAVING COUNT(c."userId") = 2;

不幸的是，这似乎还会返回包含这两个用户的对话。（例如，如果对话中还包含"userId" 5，它将返回结果）。

Answer 1

这是relational-division的情况-附加的特殊要求是，同一对话中不得有 additional 个用户。

假设是表"conversationUsers"的PK，该表强制执行组合NOT NULL的唯一性，并且还隐含地提供了性能必不可少的索引。多列PK的列按 this 顺序！否则您需要做更多的事情。
关于索引列的顺序：

Is a composite index also good for queries on the first field?

对于基本查询，有一种“蛮力” 方法，可以计算所有给定用户的 all 个会话的匹配用户数，然后过滤匹配的用户数所有给定的用户。对于较小的表和/或仅简短的输入数组和/或每个用户很少的对话，可以，但是扩展性不佳：

SELECT "conversationId"
FROM   "conversationUsers" c
WHERE  "userId" = ANY ('{1,4,6}'::int[])
GROUP  BY 1
HAVING count(*) = array_length('{1,4,6}'::int[], 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = c."conversationId"
   AND    "userId" <> ALL('{1,4,6}'::int[])
   );

使用NOT EXISTS反半联接消除与其他用户的对话。更多：

How do I (or can I) SELECT DISTINCT on multiple columns?

替代技术：

Select rows which are not present in other table

还有其他各种（更快）的relational-division查询技术。但是最快的用户不适用于动态个用户ID。

How to filter SQL results in a has-many-through relation

对于还可以处理动态数量的用户ID的快速查询，请考虑使用recursive CTE：

WITH RECURSIVE rcte AS (
   SELECT "conversationId", 1 AS idx
   FROM   "conversationUsers"
   WHERE  "userId" = ('{1,4,6}'::int[])[1]

   UNION ALL
   SELECT c."conversationId", r.idx + 1
   FROM   rcte                r
   JOIN   "conversationUsers" c USING ("conversationId")
   WHERE  c."userId" = ('{1,4,6}'::int[])[idx + 1]
   )
SELECT "conversationId"
FROM   rcte r
WHERE  idx = array_length(('{1,4,6}'::int[]), 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = r."conversationId"
   AND    "userId" <> ALL('{1,4,6}'::int[])
   );

为便于使用，将其包装在函数或prepared statement中。喜欢：

PREPARE conversations(int[]) AS
WITH RECURSIVE rcte AS (
   SELECT "conversationId", 1 AS idx
   FROM   "conversationUsers"
   WHERE  "userId" = $1[1]

   UNION ALL
   SELECT c."conversationId", r.idx + 1
   FROM   rcte                r
   JOIN   "conversationUsers" c USING ("conversationId")
   WHERE  c."userId" = $1[idx + 1]
   )
SELECT "conversationId"
FROM   rcte r
WHERE  idx = array_length($1, 1)
AND    NOT EXISTS (
   SELECT FROM "conversationUsers"
   WHERE  "conversationId" = r."conversationId"
   AND    "userId" <> ALL($1);

致电：

EXECUTE conversations('{1,4,6}');

db <>小提琴here （还演示了功能）

仍有改进的空间：要获得最佳的性能，您必须将会话最少的用户放在输入数组中，以尽早消除尽可能多的行。为了获得最佳性能，您可以动态生成一个非动态，非递归查询（使用第一个链接中的 fast 技术之一）并依次执行。您甚至可以将其包装在带有动态SQL的单个plpgsql函数中...

更多说明：

Using same column multiple times in WHERE clause

替代：稀疏表的MV

如果表"conversationUsers"主要是只读的（旧的对话不太可能改变），则可以将MATERIALIZED VIEW与预先聚集的用户一起使用在排序数组中，并在该数组上创建纯btree索引列。

CREATE MATERIALIZED VIEW mv_conversation_users AS
SELECT "conversationId", array_agg("userId") AS users  -- sorted array
FROM (
   SELECT "conversationId", "userId"
   FROM   "conversationUsers"
   ORDER  BY 1, 2
   ) sub
GROUP  BY 1
ORDER  BY 1;

CREATE INDEX ON mv_conversation_users (users) INCLUDE ("conversationId");

演示的覆盖率索引要求使用Postgres 11。

https://dba.stackexchange.com/a/207938/3684

关于对子查询中的行进行排序：

How to apply ORDER BY and LIMIT in combination with an aggregate function?

在旧版本中，在(users, "conversationId")上使用普通的多列索引。对于非常长的数组，散列索引在Postgres 10或更高版本中可能有意义。

然后更快的查询将是：

SELECT "conversationId"
FROM   mv_conversation_users c
WHERE  users = '{1,4,6}'::int[];  -- sorted array!

db <>提琴here

您必须权衡存储，写入和维护的额外成本与读取性能的好处。

除了：考虑不带双引号的合法标识符。 conversation_id代替"conversationId"等：

Are PostgreSQL column names case-sensitive?

Answer 2

您可以像这样修改查询，它应该可以正常工作：

SELECT c."conversationId"
FROM "conversationUsers" c
WHERE c."conversationId" IN (
    SELECT DISTINCT c1."conversationId"
    FROM "conversationUsers" c1
    WHERE c1."userId" IN (1, 4)
    )
GROUP BY c."conversationId"
HAVING COUNT(DISTINCT c."userId") = 2;

Answer 3

这可能更容易理解。您需要对话ID，并按其分组。根据匹配的用户ID总数等于该组中所有可能的用户ID的总和添加HAVING子句。这会起作用，但由于没有预选赛者，所以处理时间更长。

select
      cu.ConversationId
   from
      conversationUsers cu
   group by
      cu.ConversationID
   having 
      sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )

要进一步简化列表，请预先查询至少有一个人参与的对话...如果他们不打算一开始，为什么还要考虑其他对话。

select
      cu.ConversationId
   from
      ( select cu2.ConversationID
           from conversationUsers cu2
           where cu2.userID = 4 ) preQual
      JOIN conversationUsers cu
         preQual.ConversationId = cu.ConversationId
   group by
      cu.ConversationID
   having 
      sum( case when cu.userId IN (1, 4) then 1 else 0 end ) = count( distinct cu.UserID )

SQL查询以查找具有特定数量关联的行

3 个答案:

替代：稀疏表的MV