计算所有现有的记录分组组合

时间:2019-04-05 11:50:21

标签: sql postgresql

我有这些数据库表

  • 问题:id,文本
  • 答案:id,文本,question_id
  • answer_tags:id,answer_id,tag_id
  • 标签:id,文本

  • 问题有很多答案
  • answer通过answer_tags具有许多标签,属于问题
  • 标签通过answer_tags有很多答案
  • 答案的标签数量不受限制

我想展示按计数顺序排列的所有标签分组的所有组合

示例数据

Question 1, Answer 1, tag1, tag2, tag3, tag4
Question 2, Answer 2, tag2, tag3, tag4
Question 3, Answer 3, tag3, tag4
Question 4, Answer 4, tag4
Question 5, Answer 5, tag3, tag4, tag5
Question 1, Answer 6, <no tags>

如何使用SQL解决此问题?

我不确定SQL是否可行,但是我认为它需要RECURSIVE方法。

预期结果:

tag3, tag4 occur 4 times
tag2, tag3, tag4 occur 2 times
tag2, tag3 occur  2 times

我们只会返回分组大于1的结果。永远不会返回单个标签,它必须至少包含2个标签才能计数。

3 个答案:

答案 0 :(得分:4)

以@filiprem的答案为基础,并使用答案here中经过稍微修改的函数,您将得到:

--test data
create table questions (id int, text varchar(100));
create table answers (id int, text varchar(100), question_id int);
create table answer_tags (id int, answer_id int, tag_id int);
create table tags (id int, text varchar(100));

insert into questions values (1, 'question1'), (2, 'question2'), (3, 'question3'), (4, 'question4'), (5, 'question5');
insert into answers values (1, 'answer1', 1), (2, 'answer2', 2), (3, 'answer3', 3), (4, 'answer4', 4), (5, 'answer5', 5), (6, 'answer6', 1);
insert into tags values (1, 'tag1'), (2, 'tag2'), (3, 'tag3'), (4, 'tag4'), (5, 'tag5');
insert into answer_tags values 
(1,1,1), (2,1,2), (3,1,3), (4,1,4),
(5,2,2), (6,2,3), (7,2,4),
(8,3,3), (9,3,4),
(10,4,4),
(11,5,3), (12,5,4), (13,5,5);
--end test data

--function to get all possible combinations from an array with at least 2 elements
create or replace function get_combinations(source anyarray) returns setof anyarray as $$
 with recursive combinations(combination, indices) as (
   select source[i:i], array[i] from generate_subscripts(source, 1) i
   union all
   select c.combination || source[j], c.indices || j
   from   combinations c, generate_subscripts(source, 1) j
   where  j > all(c.indices) and
          array_length(c.combination, 1) <= 2
 )
 select combination from combinations
 where  array_length(combination, 1) >= 2
$$ language sql;

--expected results
SELECT tags, count(*) FROM (
    SELECT q.id, get_combinations(array_agg(DISTINCT t.text)) AS tags
    FROM questions q
    JOIN answers a ON a.question_id = q.id
    JOIN answer_tags at ON at.answer_id = a.id
    JOIN tags t ON t.id = at.tag_id
    GROUP BY q.id
) t1
GROUP BY tags
HAVING count(*)>1;

注意:这会使tag2,tag4出现2次,但未达到预期结果(问题1和2)

答案 1 :(得分:2)

您确实可以使用递归CTE产生可能的组合。首先,将所有标签ID选择为一个元素的数组。然后UNION ALL将CTE和标签ID进行联接,如果标签ID大于数组中的最大ID,则将标签ID附加到数组中。

与CTE一起加入聚合,以数组的形式获取每个答案的标签ID。在ON子句中,检查答案的数组是否包含CTE中的数组以及该数组包含运算符@>

WHERE子句中,只有一个标记会排除CTE中的组合,因为您对此不感兴趣。

现在GROUP BY标记的组合将排除在HAVING子句中出现少于两次的所有组合-您也对它们不感兴趣。如果您还希望将ID“转换”为SELECT列表中标签的名称。

WITH RECURSIVE "cte"
AS
(
SELECT ARRAY["t"."id"] "id"
       FROM "tags" "t"
UNION ALL
SELECT "c"."id" || "t"."id" "id"
       FROM "cte" "c"
            INNER JOIN "tags" "t"
                       ON "t"."id" > (SELECT max("un"."e")
                                             FROM unnest("c"."id") "un" ("e"))
)
SELECT "c"."id" "id",
       (SELECT array_agg("t"."text")
               FROM unnest("c"."id") "un" ("e")
                    INNER JOIN "tags" "t"
                               ON "t"."id" = "un"."e") "text",
       count(*) "count"
       FROM "cte" "c"
            INNER JOIN (SELECT array_agg("at"."tag_id" ORDER BY "at"."tag_id") "id"
                               FROM "answer_tags" "at"
                               GROUP BY at.answer_id) "x"
                       ON "x"."id" @> "c"."id"
       WHERE array_length("c"."id", 1) > 1
       GROUP BY "c"."id"
       HAVING count(*) > 1;

结果:

 id      | text             | count
---------+------------------+-------
 {2,3}   | {tag2,tag3}      |     2
 {3,4}   | {tag3,tag4}      |     4
 {2,4}   | {tag2,tag4}      |     2
 {2,3,4} | {tag2,tag3,tag4} |     2

db<>fiddle

答案 2 :(得分:1)

尝试一下:

SELECT tags, count(*) FROM (
    SELECT q.id, array_agg(DISTINCT t.text) AS tags
    FROM questions q
    JOIN answers a ON a.question_id = q.id
    JOIN answer_tags at ON at.answer_id = a.id
    JOIN tags t ON t.id = at.tag_id
    GROUP BY q.id
) t1
GROUP BY tags
HAVING count(*)>1;