Question

我在postgresql中有两个表与多对多关联。第一个表包含活动，可能有零个或多个原因：

CREATE TABLE activity (
   id integer NOT NULL,
   -- other fields removed for readability
);

CREATE TABLE reason (
   id varchar(1) NOT NULL,
   -- other fields here
);

为了执行关联，这两个表之间存在的连接表：

CREATE TABLE activity_reason ( activity_id integer NOT NULL, -- refers to activity.id reason_id varchar(1) NOT NULL, -- refers to reason.id CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id), CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id) );

我想计算活动和原因之间可能存在的关联。假设我在表activity_reason中有这些记录：

+--------------+------------+ | activity_id | reason_id | +--------------+------------+ | 1 | A | | 1 | B | | 2 | A | | 2 | B | | 3 | A | | 4 | C | | 4 | D | | 4 | E | +--------------+------------+

我应该有类似的东西：

+-------+---+------+-------+ | count | | | | +-------+---+------+-------+ | 2 | A | B | NULL | | 1 | A | NULL | NULL | | 1 | C | D | E | +-------+---+------+-------+

或者，最终，像：

+-------+-------+ | count | | +-------+-------+ | 2 | A,B | | 1 | A | | 1 | C,D,E | +-------+-------+

我找不到SQL查询来执行此操作。

Answer 1

我认为您可以使用此查询获得所需内容：

SELECT count(*) as count, reasons
FROM (
  SELECT activity_id, array_agg(reason_id) AS reasons
  FROM (
    SELECT A.activity_id, AR.reason_id
    FROM activity A
    LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
    ORDER BY activity_id, reason_id
  ) AS ordered_reasons
  GROUP BY activity_id
) reason_arrays
GROUP BY reasons

首先，您将活动的所有原因汇总到每个活动的数组中。您必须首先订购关联，否则['a'，'b']和['b'，'a']将被视为不同的集合，并且将具有单独的计数。您还需要包含联接或任何没有任何原因的活动不会显示在结果集中。我不确定这是否可取，如果你想要没有理由不包含在内的活动，我可以把它拿回去。然后计算具有相同原因的活动数量。

以下是sqlfiddle来演示

正如Gordon Linoff所提到的，您也可以使用字符串而不是数组。我不确定哪种性能更好。

Answer 2

我们需要比较排序的原因列表来识别相同的集合。

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT array_agg(reason_id) AS reason_list
   FROM  (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

最里面的子查询中的

ORDER BY reason_id也可以，但添加activity_id通常会更快。

我们根本不需要最里面的子查询。这也有效：

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
   FROM   activity_reason
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

但处理全部或大部分表格通常较慢。 Quoting the manual:

或者，从排序的子查询中提供输入值通常可以正常工作。

我们可以使用string_agg()而不是array_agg()，这对于varchar(1)的示例有效（对于数据类型{{可能更有效） 1}}，顺便说一句。但是，对于更长的字符串，它可能会失败聚合值可能不明确。

如果"char"是 reason_id （通常是这样），那么来自附加模块{{3}的另一个更快的解决方案integer }}：

sort()

查询以计算多对多关联的频率

3 个答案: