Question

我必须找到一对学生，他们从studentID和courseID的表中选择完全相同的课程。

studentID | courseID
    1           1
    1           2
    1           3
    2           1
    3           1
    3           2
    3           3

查询应返回(1, 3) 结果也不应包含重复的行，例如(1,3)和(3,1)。

Answer 1

给出样本数据：

CREATE TABLE student_course (
   student_id integer,
   course_id integer,
   PRIMARY KEY (student_id, course_id)
);

INSERT INTO student_course (student_id, course_id)
VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3) ;

使用数组聚合

一种选择是使用CTE加入每个学生正在服用的有序课程列表：

WITH student_coursearray(student_id, courses) AS (
  SELECT student_id, array_agg(course_id ORDER BY course_id)
  FROM student_course
  GROUP BY student_id
)
SELECT a.student_id, b.student_id
FROM student_coursearray a INNER JOIN student_coursearray b ON (a.courses = b.courses)
WHERE a.student_id > b.student_id;

array_agg实际上是SQL标准的一部分，WITH公用表表达式语法也是如此。 MySQL也不支持所以如果你想支持MySQL，你必须以不同的方式表达它。

查找每位学生缺少的课程配对

考虑到这一点的另一种方式是“为每个学生配对，找出一个人是否上课，另一个不是”。这会将自己的自我赋予FULL OUTER JOIN，但表达起来却很尴尬。您必须确定感兴趣的学生ID的配对，然后对于每个配对，在每个类的集合中进行完全外部连接。如果有任何空行，那么一个人接受另一个没有的类，所以你可以使用NOT EXISTS过滤器来排除这样的配对。这给了你这个怪物：

WITH student_id_pairs(left_student, right_student) AS (
  SELECT DISTINCT a.student_id, b.student_id
  FROM student_course a 
  INNER JOIN student_course b ON (a.student_id > b.student_id)
)
SELECT left_student, right_student 
FROM student_id_pairs 
WHERE NOT EXISTS (
  SELECT 1
  FROM (SELECT course_id FROM student_course WHERE student_id = left_student) a
  FULL OUTER JOIN (SELECT course_id FROM student_course b WHERE student_id = right_student) b
    ON (a.course_id = b.course_id)
  WHERE a.course_id IS NULL or b.course_id IS NULL
);

CTE是可选的，如果您的数据库不支持CTE，可以用CREATE TEMPORARY TABLE AS SELECT ...或其他任何内容替换。

使用哪种？

我非常有信心阵列方法在所有情况下都会表现更好，特别是因为对于一个非常大的数据集，您可以使用WITH表达式，而不是从查询中创建一个临时表，在(courses, student_id)上添加一个索引并进行疯狂快速的相等搜索，以便真正支付索引创建时间的成本。你不能用子查询连接方法做到这一点。

Answer 2

select courses,group_concat(studentID) from
(select studentID, 
group_concat(courseID order by courseID) as courses
from Table1 group by studentID) abc
group by courses having courses like('%,%');

的 fiddle 的

Answer 3

测试用例：

我创建了一个有点现实的测试用例：

CREATE TEMP TABLE student_course (
   student_id integer
  ,course_id integer
  ,PRIMARY KEY (student_id, course_id)
);

INSERT INTO student_course
SELECT *
FROM (VALUES (1, 1), (1, 2), (1, 3), (2, 1), (3, 1), (3, 2), (3, 3)) v
      -- to include some non-random values in test
UNION  ALL
SELECT DISTINCT student_id, normal_rand((random() * 30)::int, 1000, 35)::int
FROM   generate_series(4, 5000) AS student_id;

DELETE FROM student_course WHERE random() > 0.9; -- create some dead tuples
ANALYZE student_course; -- needed for temp table

请注意使用normal_rand()以正常的值分布填充虚拟表。它随tablefunc模块一起提供，因为无论如何我将继续使用它...

另请注意粗体强调我要操作的数字，以便基准模拟各种测试用例。

普通SQL

这个问题基本而且不清楚。找到匹配课程的前两个学生？还是找到所有？找到他们或共享相同课程的学生团体的夫妇？克雷格回答：
查找所有情侣分享相同的课程。

C1 - Craig's first query

普通SQL使用CTE并按数组分组，格式略有：

WITH student_coursearray(student_id, courses) AS (
   SELECT student_id, array_agg(course_id ORDER BY course_id)
   FROM   student_course
   GROUP  BY student_id
   )
SELECT a.student_id, b.student_id
FROM   student_coursearray a
JOIN   student_coursearray b ON (a.courses = b.courses)
WHERE  a.student_id < b.student_id
ORDER  BY a.student_id, b.student_id;

Craig的答案中的第二个查询马上退出比赛。只有几行，性能会迅速恶化。 CROSS JOIN是毒药。

E1 - 改进版

有一个主要的弱点，每个聚合ORDER BY表现不佳，所以我在子查询中用ORDER BY重写：

WITH cte AS (
   SELECT student_id, array_agg(course_id) AS courses
   FROM  (SELECT student_id, course_id FROM student_course ORDER BY 1, 2) sub
   GROUP  BY student_id
   )
SELECT a.student_id, b.student_id
FROM   cte a
JOIN   cte b USING (courses)
WHERE  a.student_id < b.student_id
ORDER  BY 1,2;

E2 - 问题的替代解释

我认为一般更有用的案例是：
查找所有学生分享相同的课程。
所以我返回了一系列匹配课程的学生。

WITH s AS (
   SELECT student_id, array_agg(course_id) AS courses
   FROM  (SELECT student_id, course_id FROM student_course ORDER BY 1, 2) sub
   GROUP  BY student_id
   )
SELECT array_agg(student_id)
FROM   s
GROUP  BY courses
HAVING count(*) > 1
ORDER    BY array_agg(student_id);

F1 - 动态PL / pgSQL函数

为了使这个通用且快速，我用动态SQL将其包装成 plpgsql函数：

CREATE OR REPLACE FUNCTION f_same_set(_tbl regclass, _id text, _match_id text)
  RETURNS SETOF int[] AS
$func$
BEGIN

RETURN QUERY EXECUTE format(
   $f$
   WITH s AS (
      SELECT %1$I AS id, array_agg(%2$I) AS courses
      FROM   (SELECT %1$I, %2$I FROM %3$s ORDER BY 1, 2) s
      GROUP  BY 1
      )
   SELECT array_agg(id)
   FROM   s
   GROUP  BY courses
   HAVING count(*) > 1
   ORDER    BY array_agg(id)
   $f$
   ,_id, _match_id, _tbl
   );
END
$func$  LANGUAGE plpgsql;

呼叫：

SELECT * FROM f_same_set('student_course', 'student_id', 'course_id');

适用于带有数字列的任何表。扩展其他数据类型也是微不足道的。

`crosstab()`

对于 tablefunc <提供的courses（以及任意大量学生） crosstab() 的相对小数 / strong> module是PostgreSQL中的另一个选项。更一般的信息在这里：
PostgreSQL Crosstab Query

简单案例

问题中简单示例的一个简单案例，很多like explained in the linked answer：

SELECT array_agg(student_id) FROM crosstab(' SELECT student_id, course_id, TRUE FROM student_course ORDER BY 1' ,'VALUES (1),(2),(3)' ) AS t(student_id int, c1 bool, c2 bool, c3 bool) GROUP BY c1, c2, c3 HAVING count(*) > 1;

F2 - 动态交叉表功能

对于简单的情况，交叉表变体更快，因此我使用动态SQL构建了一个plpgsql函数，并将其包含在测试中。功能上与 F1 相同。

CREATE OR REPLACE FUNCTION f_same_set_x(_tbl regclass, _id text, _match_id text) RETURNS SETOF int[] AS $func$ DECLARE _ids int[]; -- for array of match_ids (course_id in example) BEGIN -- Get list of match_ids EXECUTE format( 'SELECT array_agg(DISTINCT %1$I ORDER BY %1$I) FROM %2$s',_match_id, _tbl) INTO _ids; -- Main query RETURN QUERY EXECUTE format( $f$ SELECT array_agg(%1$I) FROM crosstab('SELECT %1$I, %2$I, TRUE FROM %3$s ORDER BY 1' ,'VALUES (%4$s)') AS t(student_id int, c%5$s bool) GROUP BY c%6$s HAVING count(*) > 1 ORDER BY array_agg(student_id) $f$ ,_id ,_match_id ,_tbl ,array_to_string(_ids, '),(') -- values ,array_to_string(_ids, ' bool,c') -- column def list ,array_to_string(_ids, ',c') -- names ); END $func$ LANGUAGE plpgsql;

呼叫：

SELECT * FROM f_same_set_x('student_course', 'student_id', 'course_id');

基准

我在我的小型PostgreSQL测试服务器上测试过。在@ 6岁的AMD Opteron服务器上，Debian Linux上的PostgreSQL 9.1.9。我使用上述设置和每个呈现的查询运行了5个测试集。最佳5分EXPLAIN ANALYZE。

我在上面的测试用例中使用这些值作为粗体数字来填充：


NR。学生/最多NR。课程/标准差（结果更加截然不同）course    1。 1000/30 / 35
   2。 5000/30 / 50
   3。 10000/30 / 100
   4。 10000 / 10/10    5. 10000/5/5

<强> C1
1.总运行时间：57 ms
2.总运行时间：315毫秒
3.总运行时间：663毫秒
4.总运行时间：543 ms
5.总运行时间： 2345 ms （！） - 随着很多对而恶化

<强> E1
1.总运行时间：46毫秒
2.总运行时间：251毫秒
3.总运行时间：529 ms
4.总运行时间：338毫秒
5.总运行时间：734毫秒

<强> E2
1.总运行时间：45毫秒
2.总运行时间：245毫秒
3.总运行时间：515毫秒
4.总运行时间：218毫秒
5.总运行时间：143毫秒

F1 胜利者
1.总运行时间： 14 ms
2.总运行时间： 77 ms
3.总运行时间： 166 ms
4.总运行时间： 80 ms
5.总运行时间： 54 ms

<强> F2
1.总运行时间：62 ms
2.总运行时间：336毫秒
3.总运行时间：1053毫秒（！）交叉表（）因许多不同的值而恶化 4.总运行时间：195毫秒
5.总运行时间：105毫秒（！），但使用较少的不同值
表现良好
带有动态SQL的PL / pgSQL函数，对子查询中的行进行排序是显而易见的。

Answer 4

天真的关系师实施，CTE：

WITH pairs AS (
        SELECT DISTINCT a.student_id AS aaa
        , b.student_id AS bbb
        FROM student_course a
        JOIN student_course b ON a.course_id = b.course_id
        )
SELECT *
FROM pairs p
WHERE p.aaa < p.bbb
AND NOT EXISTS (
        SELECT * FROM student_course nx1
        WHERE nx1.student_id = p.aaa
        AND NOT EXISTS (
                SELECT * FROM student_course nx2
                WHERE nx2.student_id = p.bbb
                AND nx2.course_id = nx1.course_id
                )
        )
AND NOT EXISTS (
        SELECT * FROM student_course nx1
        WHERE nx1.student_id = p.bbb
        AND NOT EXISTS (
                SELECT * FROM student_course nx2
                WHERE nx2.student_id = p.aaa
                AND nx2.course_id = nx1.course_id
                )
        )
        ;

同样，没有CTE：

SELECT *
FROM (
        SELECT DISTINCT a.student_id AS aaa
        , b.student_id AS bbb
        FROM student_course a
        JOIN student_course b ON a.course_id = b.course_id
        ) p
WHERE p.aaa < p.bbb
AND NOT EXISTS (
        SELECT * FROM student_course nx1
        WHERE nx1.student_id = p.aaa
        AND NOT EXISTS (
                SELECT * FROM student_course nx2
                WHERE nx2.student_id = p.bbb
                AND nx2.course_id = nx1.course_id
                )
        )
AND NOT EXISTS (
        SELECT * FROM student_course nx1
        WHERE nx1.student_id = p.bbb
        AND NOT EXISTS (
                SELECT * FROM student_course nx2
                WHERE nx2.student_id = p.aaa
                AND nx2.course_id = nx1.course_id
                )
        )
        ;

显然，非CTE版本更快。

Answer 5

在mysql中完成此过程的过程

Create table student_course_agg 
( 
student_id int,
courses varchar(150)
);

INSERT INTO student_course_agg
select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS 
GROUP BY 1;

SELECT master.student_id m_student_id,child.student_id c_student_id
FROM student_course_agg master 
JOIN student_course_ag child 
    ON master.student_id<child.student_id and master.courses=child.courses;

直接查询。

SELECT master.student_id m_student_id,child.student_id c_student_id
FROM (select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS 
GROUP BY 1) master
JOIN (select studentID ,GROUP_CONCAT(courseID ORDER BY courseID) courses
FROM STUDENTS 
GROUP BY 1) child 
   ON master.studentID <child.studentID and master.courses=child.courses;

找到一对完全相同课程的学生

5 个答案: