查询以计算多对多关联的频率

时间:2016-04-06 00:28:19

标签: sql arrays postgresql many-to-many aggregate

我在postgresql中有两个表与多对多关联。第一个表包含活动,可能有零个或多个原因:

CREATE TABLE activity (
   id integer NOT NULL,
   -- other fields removed for readability
);

CREATE TABLE reason (
   id varchar(1) NOT NULL,
   -- other fields here
);

为了执行关联,这两个表之间存在的连接表:

CREATE TABLE activity_reason (
   activity_id integer NOT NULL, -- refers to activity.id
   reason_id varchar(1) NOT NULL, -- refers to reason.id
   CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
  CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);

我想计算活动和原因之间可能存在的关联。假设我在表activity_reason中有这些记录:

+--------------+------------+
| activity_id  |  reason_id |
+--------------+------------+
|           1  |          A |
|           1  |          B |
|           2  |          A |
|           2  |          B |
|           3  |          A |
|           4  |          C |
|           4  |          D |
|           4  |          E |
+--------------+------------+

我应该有类似的东西:

+-------+---+------+-------+
| count |   |      |       |
+-------+---+------+-------+
|     2 | A | B    | NULL  |
|     1 | A | NULL | NULL  |
|     1 | C | D    | E     |
+-------+---+------+-------+

或者,最终,像:

+-------+-------+
| count |       |
+-------+-------+
|     2 | A,B   |
|     1 | A     |
|     1 | C,D,E |
+-------+-------+

我找不到SQL查询来执行此操作。

3 个答案:

答案 0 :(得分:2)

我认为您可以使用此查询获得所需内容:

SELECT count(*) as count, reasons
FROM (
  SELECT activity_id, array_agg(reason_id) AS reasons
  FROM (
    SELECT A.activity_id, AR.reason_id
    FROM activity A
    LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
    ORDER BY activity_id, reason_id
  ) AS ordered_reasons
  GROUP BY activity_id
) reason_arrays
GROUP BY reasons

首先,您将活动的所有原因汇总到每个活动的数组中。您必须首先订购关联,否则['a','b']和['b','a']将被视为不同的集合,并且将具有单独的计数。您还需要包含联接或任何没有任何原因的活动不会显示在结果集中。我不确定这是否可取,如果你想要没有理由不包含在内的活动,我可以把它拿回去。然后计算具有相同原因的活动数量。

以下是sqlfiddle来演示

正如Gordon Linoff所提到的,您也可以使用字符串而不是数组。我不确定哪种性能更好。

答案 1 :(得分:1)

我们需要比较排序的原因列表来识别相同的集合。

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT array_agg(reason_id) AS reason_list
   FROM  (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;
最里面的子查询中的

ORDER BY reason_id也可以,但添加activity_id通常会更快。

我们根本不需要最里面的子查询。这也有效:

SELECT count(*) AS ct, reason_list
FROM  (
   SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
   FROM   activity_reason
   GROUP  BY activity_id
   ) ar2
GROUP  BY reason_list
ORDER  BY ct DESC, reason_list;

但处理全部或大部分表格通常较慢。 Quoting the manual:

  

或者,从排序的子查询中提供输入值通常可以正常工作。

我们可以使用string_agg()而不是array_agg(),这对于varchar(1)的示例有效(对于数据类型{{可能更有效) 1}},顺便说一句。但是,对于更长的字符串,它可能会失败聚合值可能不明确。

如果"char" reason_id (通常是这样),那么来自附加模块{{3}的另一个更快的解决方案integer }}:

sort()

相关,有更多解释:

答案 2 :(得分:0)

您可以使用string_agg()

执行此操作
select reasons, count(*)
from (select activity_id, string_agg(reason_id, ',' order by reason_id) as reasons
      from activity_reason
      group by activity_id
     ) a
group by reasons
order by count(*) desc;