如何以多次通过关系过滤SQL结果

时间:2011-09-09 16:54:40

标签: mysql sql postgresql sql-match-all relational-division

假设我有表studentclubstudent_club

student {
    id
    name
}
club {
    id
    name
}
student_club {
    student_id
    club_id
}

我想知道如何找到足球(30)和棒球(50)俱乐部的所有学生 虽然这个查询不起作用,但它是我迄今为止最接近的事情:

SELECT student.*
FROM   student
INNER  JOIN student_club sc ON student.id = sc.student_id
LEFT   JOIN club c ON c.id = sc.club_id
WHERE  c.id = 30 AND c.id = 50

13 个答案:

答案 0 :(得分:129)

我很好奇。众所周知,好奇心因杀猫而闻名。

那么,这是给猫皮肤最快的方法吗?

此测试的精确猫皮环境:

    Debian Squeeze上的
  • PostgreSQL 9.0 具有不错的内存和设置。
  • 6.000名学生,24.000个俱乐部会员资格(从具有现实生活数据的类似数据库中复制的数据。)
  • 从问题中的命名架构略微转移:student.id student.stud_idclub.id此处为club.club_id
  • 我在这个帖子中的作者之后命名了查询,其索引有两个。
  • 我运行了几次所有查询来填充缓存,然后我使用EXPLAIN ANALYZE选择了最好的5个。
  • 相关指数(应该是最佳的 - 只要我们缺乏哪些俱乐部将被查询的前瞻性知识):

    ALTER TABLE student ADD CONSTRAINT student_pkey PRIMARY KEY(stud_id );
    ALTER TABLE student_club ADD CONSTRAINT sc_pkey PRIMARY KEY(stud_id, club_id);
    ALTER TABLE club       ADD CONSTRAINT club_pkey PRIMARY KEY(club_id );
    CREATE INDEX sc_club_id_idx ON student_club (club_id);
    
    此处大多数查询都不需要

    club_pkey 主键在PostgreSQL中自动实现唯一索引 最后一个索引是为了弥补PostgreSQL上multi-column indexes的这个已知缺点:

  

多列B树索引可以与查询条件一起使用   涉及索引列的任何子集,但索引最多   在领先(最左边)有限制时有效   列。

结果:

EXPLAIN ANALYZE的总运行时间。

1)Martin 2:44.594 ms

SELECT s.stud_id, s.name
FROM   student s
JOIN   student_club sc USING (stud_id)
WHERE  sc.club_id IN (30, 50)
GROUP  BY 1,2
HAVING COUNT(*) > 1;

2)Erwin 1:33.217 ms

SELECT s.stud_id, s.name
FROM   student s
JOIN   (
   SELECT stud_id
   FROM   student_club
   WHERE  club_id IN (30, 50)
   GROUP  BY 1
   HAVING COUNT(*) > 1
   ) sc USING (stud_id);

<3>马丁1:31.735毫秒
SELECT s.stud_id, s.name
   FROM   student s
   WHERE  student_id IN (
   SELECT student_id
   FROM   student_club
   WHERE  club_id = 30
   INTERSECT
   SELECT stud_id
   FROM   student_club
   WHERE  club_id = 50);

4)Derek:2.287 ms

SELECT s.stud_id,  s.name
FROM   student s
WHERE  s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 30)
AND    s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 50);

5)Erwin 2:2.181 ms

SELECT s.stud_id,  s.name
FROM   student s
WHERE  EXISTS (SELECT * FROM student_club
               WHERE  stud_id = s.stud_id AND club_id = 30)
AND    EXISTS (SELECT * FROM student_club
               WHERE  stud_id = s.stud_id AND club_id = 50);

6)肖恩:2.043毫秒

SELECT s.stud_id, s.name
FROM   student s
JOIN   student_club x ON s.stud_id = x.stud_id
JOIN   student_club y ON s.stud_id = y.stud_id
WHERE  x.club_id = 30
AND    y.club_id = 50;

最后三个表演几乎相同。 4)和5)产生相同的查询计划。

延迟补充:

花哨的SQL,但性能跟不上。

7)ypercube 1:148.649 ms

SELECT s.stud_id,  s.name
FROM   student AS s
WHERE  NOT EXISTS (
   SELECT *
   FROM   club AS c 
   WHERE  c.club_id IN (30, 50)
   AND    NOT EXISTS (
      SELECT *
      FROM   student_club AS sc 
      WHERE  sc.stud_id = s.stud_id
      AND    sc.club_id = c.club_id  
      )
   );

8)ypercube 2:147.497 ms

SELECT s.stud_id,  s.name
FROM   student AS s
WHERE  NOT EXISTS (
   SELECT *
   FROM  (
      SELECT 30 AS club_id  
      UNION  ALL
      SELECT 50
      ) AS c
   WHERE NOT EXISTS (
      SELECT *
      FROM   student_club AS sc 
      WHERE  sc.stud_id = s.stud_id
      AND    sc.club_id = c.club_id  
      )
   );

正如所料,这两者表现几乎相同。查询计划导致表扫描,计划程序在此处找不到使用索引的方法。


9)wildplasser 1:49.849 ms

WITH RECURSIVE two AS (
   SELECT 1::int AS level
        , stud_id
   FROM   student_club sc1
   WHERE  sc1.club_id = 30
   UNION
   SELECT two.level + 1 AS level
        , sc2.stud_id
   FROM   student_club sc2
   JOIN   two USING (stud_id)
   WHERE  sc2.club_id = 50
   AND    two.level = 1
   )
SELECT s.stud_id, s.student
FROM   student s
JOIN   two USING (studid)
WHERE  two.level > 1;

花哨的SQL,CTE的不错表现。非常奇特的查询计划。
再一次,有趣的是9.1如何处理这个问题。我将尽快将此处使用的数据库集群升级到9.1。也许我会重新运行整个shebang ...


10)wildplasser 2:36.986 ms

WITH sc AS (
   SELECT stud_id
   FROM   student_club
   WHERE  club_id IN (30,50)
   GROUP  BY stud_id
   HAVING COUNT(*) > 1
   )
SELECT s.*
FROM   student s
JOIN   sc USING (stud_id);

查询2的CTE变体。令人惊讶的是,它可能会导致略有不同的查询计划与完全相同的数据。我在student上找到了顺序扫描,其中子查询变量使用了索引。


11)ypercube 3:101.482 ms

另一个晚期加入@ypercube。令人惊讶的是,有多少种方式。

SELECT s.stud_id, s.student
FROM   student s
JOIN   student_club sc USING (stud_id)
WHERE  sc.club_id = 10                 -- member in 1st club ...
AND    NOT EXISTS (
   SELECT *
   FROM  (SELECT 14 AS club_id) AS c  -- can't be excluded for missing the 2nd
   WHERE  NOT EXISTS (
      SELECT *
      FROM   student_club AS d
      WHERE  d.stud_id = sc.stud_id
      AND    d.club_id = c.club_id
      )
   )

12)erwin 3:2.377 ms

@ ypercube的11)实际上只是这个简单变​​体的扭曲扭曲的方法,但仍然缺失。表现几乎与顶级猫一样快。

SELECT s.*
FROM   student s
JOIN   student_club x USING (stud_id)
WHERE  sc.club_id = 10                 -- member in 1st club ...
AND    EXISTS (                        -- ... and membership in 2nd exists
   SELECT *
   FROM   student_club AS y
   WHERE  y.stud_id = s.stud_id
   AND    y.club_id = 14
   )

13)erwin 4:2.375 ms

很难相信,但这是另一个真正的新变种。我看到有两个以上会员资格的潜力,但它也只有两个成为顶级猫。

SELECT s.*
FROM   student AS s
WHERE  EXISTS (
   SELECT *
   FROM   student_club AS x
   JOIN   student_club AS y USING (stud_id)
   WHERE  x.stud_id = s.stud_id
   AND    x.club_id = 14
   AND    y.club_id = 10
   )

俱乐部会员资格的动态数量

换句话说:不同数量的过滤器。这个问题要求确切 两个 俱乐部会员资格。但是许多用例必须为不同的数量做好准备。

此相关后续答案中的详细讨论:

答案 1 :(得分:18)

SELECT s.*
FROM student s
INNER JOIN student_club sc_soccer ON s.id = sc_soccer.student_id
INNER JOIN student_club sc_baseball ON s.id = sc_baseball.student_id
WHERE 
 sc_baseball.club_id = 50 AND 
 sc_soccer.club_id = 30

答案 2 :(得分:10)

select *
from student
where id in (select student_id from student_club where club_id = 30)
and id in (select student_id from student_club where club_id = 50)

答案 3 :(得分:5)

如果您只想要student_id,那么:

    Select student_id
      from student_club
     where club_id in ( 30, 50 )
  group by student_id
    having count( student_id ) = 2

如果您还需要学生的姓名:

Select student_id, name
  from student s
 where exists( select *
                 from student_club sc
                where s.student_id = sc.student_id
                  and club_id in ( 30, 50 )
             group by sc.student_id
               having count( sc.student_id ) = 2 )

如果你在club_selection表中有两个以上的俱乐部,那么:

Select student_id, name
  from student s
 where exists( select *
                 from student_club sc
                where s.student_id = sc.student_id
                  and exists( select * 
                                from club_selection cs
                               where sc.club_id = cs.club_id )
             group by sc.student_id
               having count( sc.student_id ) = ( select count( * )
                                                   from club_selection ) )

答案 4 :(得分:4)

SELECT *
FROM   student
WHERE  id IN (SELECT student_id
              FROM   student_club
              WHERE  club_id = 30
              INTERSECT
              SELECT student_id
              FROM   student_club
              WHERE  club_id = 50)  

或者更通用的解决方案更容易扩展到n俱乐部,避免INTERSECT(MySQL中不可用)和INperformance of this sucks in MySQL

SELECT s.id,
       s.name
FROM   student s
       join student_club sc
         ON s.id = sc.student_id
WHERE  sc.club_id IN ( 30, 50 )
GROUP  BY s.id,
          s.name
HAVING COUNT(DISTINCT sc.club_id) = 2  

答案 5 :(得分:4)

另一个CTE。它看起来很干净,但它可能会生成与普通子查询中的groupby相同的计划。

WITH two AS (
    SELECT student_id FROM tmp.student_club
    WHERE club_id IN (30,50)
    GROUP BY student_id
    HAVING COUNT(*) > 1
    )
SELECT st.* FROM tmp.student st
JOIN two ON (two.student_id=st.id)
    ;

对于那些想要测试的人,我的生成testdata的副本是:

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp;

CREATE TABLE tmp.student
    ( id INTEGER NOT NULL PRIMARY KEY
    , sname VARCHAR
    );

CREATE TABLE tmp.club
    ( id INTEGER NOT NULL PRIMARY KEY
    , cname VARCHAR
    );

CREATE TABLE tmp.student_club
    ( student_id INTEGER NOT NULL  REFERENCES tmp.student(id)
    , club_id INTEGER NOT NULL  REFERENCES tmp.club(id)
    );

INSERT INTO tmp.student(id)
    SELECT generate_series(1,1000)
    ;

INSERT INTO tmp.club(id)
    SELECT generate_series(1,100)
    ;

INSERT INTO tmp.student_club(student_id,club_id)
    SELECT st.id  , cl.id
    FROM tmp.student st, tmp.club cl
    ;

DELETE FROM tmp.student_club
WHERE random() < 0.8
    ;

UPDATE tmp.student SET sname = 'Student#' || id::text ;
UPDATE tmp.club SET cname = 'Soccer' WHERE id = 30;
UPDATE tmp.club SET cname = 'Baseball' WHERE id = 50;

ALTER TABLE tmp.student_club
    ADD PRIMARY KEY (student_id,club_id)
    ;

答案 6 :(得分:3)

所以有一种方法可以给猫皮肤 我会添加两个来制作它,更完整。

1)GROUP first,稍后加入

假设(student_id, club_id)student_club 唯一的理智数据模型。 马丁史密斯的第二个版本有点类似,但他后来加入了第一组。这应该更快:

SELECT s.id, s.name
  FROM student s
  JOIN (
   SELECT student_id
     FROM student_club
    WHERE club_id IN (30, 50)
    GROUP BY 1
   HAVING COUNT(*) > 1
       ) sc USING (student_id);

2)EXISTS

当然,还有经典EXISTS。类似于Derek与IN的变体。简单快捷。 (在MySQL中,这应该比使用IN的变体快得多):

SELECT s.id, s.name
  FROM student s
 WHERE EXISTS (SELECT 1 FROM student_club
               WHERE  student_id = s.student_id AND club_id = 30)
   AND EXISTS (SELECT 1 FROM student_club
               WHERE  student_id = s.student_id AND club_id = 50);

答案 7 :(得分:3)

由于没有人添加此(经典)版本:

SELECT s.*
FROM student AS s
WHERE NOT EXISTS
      ( SELECT *
        FROM club AS c 
        WHERE c.id IN (30, 50)
          AND NOT EXISTS
              ( SELECT *
                FROM student_club AS sc 
                WHERE sc.student_id = s.id
                  AND sc.club_id = c.id  
              )
      )

或类似:

SELECT s.*
FROM student AS s
WHERE NOT EXISTS
      ( SELECT *
        FROM
          ( SELECT 30 AS club_id  
          UNION ALL
            SELECT 50
          ) AS c
        WHERE NOT EXISTS
              ( SELECT *
                FROM student_club AS sc 
                WHERE sc.student_id = s.id
                  AND sc.club_id = c.club_id  
              )
      )

再试一次略有不同的方法。灵感来自Explain Extended: Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS中的一篇文章:

SELECT s.*
FROM student_club AS sc
  JOIN student AS s
    ON s.student_id = sc.student_id
WHERE sc.club_id = 50                      --- one option here
  AND NOT EXISTS
      ( SELECT *
        FROM
          ( SELECT 30 AS club_id           --- all the rest in here
                                           --- as in previous query
          ) AS c
        WHERE NOT EXISTS
              ( SELECT *
                FROM student_club AS scc 
                WHERE scc.student_id = sc.id
                  AND scc.club_id = c.club_id  
              )
      )

另一种方法:

SELECT s.stud_id
FROM   student s

EXCEPT

SELECT stud_id
FROM 
  ( SELECT s.stud_id, c.club_id
    FROM student s 
      CROSS JOIN (VALUES (30),(50)) c (club_id)
  EXCEPT
    SELECT stud_id, club_id
    FROM student_club
    WHERE club_id IN (30, 50)   -- optional. Not needed but may affect performance
  ) x ;   

答案 8 :(得分:2)

WITH RECURSIVE two AS
    ( SELECT 1::integer AS level
    , student_id
    FROM tmp.student_club sc0
    WHERE sc0.club_id = 30
    UNION
    SELECT 1+two.level AS level
    , sc1.student_id
    FROM tmp.student_club sc1
    JOIN two ON (two.student_id = sc1.student_id)
    WHERE sc1.club_id = 50
    AND two.level=1
    )
SELECT st.* FROM tmp.student st
JOIN two ON (two.student_id=st.id)
WHERE two.level> 1

    ;

这似乎表现得相当不错,因为CTE扫描避免了对两个独立子查询的需要。

总是有理由滥用递归查询!

(顺便说一句:mysql似乎没有递归查询)

答案 9 :(得分:1)

查询2)和10)

中的不同查询计划

我在真实的数据库中进行了测试,因此这些名称与catskin列表不同。它是一个备份副本,因此在所有测试运行期间没有任何变化(除了对目录的微小更改)。

查询2)

SELECT a.*
FROM   ef.adr a
JOIN (
    SELECT adr_id
    FROM   ef.adratt
    WHERE  att_id IN (10,14)
    GROUP  BY adr_id
    HAVING COUNT(*) > 1) t using (adr_id);

Merge Join  (cost=630.10..1248.78 rows=627 width=295) (actual time=13.025..34.726 rows=67 loops=1)
  Merge Cond: (a.adr_id = adratt.adr_id)
  ->  Index Scan using adr_pkey on adr a  (cost=0.00..523.39 rows=5767 width=295) (actual time=0.023..11.308 rows=5356 loops=1)
  ->  Sort  (cost=630.10..636.37 rows=627 width=4) (actual time=12.891..13.004 rows=67 loops=1)
        Sort Key: adratt.adr_id
        Sort Method:  quicksort  Memory: 28kB
        ->  HashAggregate  (cost=450.87..488.49 rows=627 width=4) (actual time=12.386..12.710 rows=67 loops=1)
              Filter: (count(*) > 1)
              ->  Bitmap Heap Scan on adratt  (cost=97.66..394.81 rows=2803 width=4) (actual time=0.245..5.958 rows=2811 loops=1)
                    Recheck Cond: (att_id = ANY ('{10,14}'::integer[]))
                    ->  Bitmap Index Scan on adratt_att_id_idx  (cost=0.00..94.86 rows=2803 width=0) (actual time=0.217..0.217 rows=2811 loops=1)
                          Index Cond: (att_id = ANY ('{10,14}'::integer[]))
Total runtime: 34.928 ms

查询10)

WITH two AS (
    SELECT adr_id
    FROM   ef.adratt
    WHERE  att_id IN (10,14)
    GROUP  BY adr_id
    HAVING COUNT(*) > 1
    )
SELECT a.*
FROM   ef.adr a
JOIN   two using (adr_id);

Hash Join  (cost=1161.52..1261.84 rows=627 width=295) (actual time=36.188..37.269 rows=67 loops=1)
  Hash Cond: (two.adr_id = a.adr_id)
  CTE two
    ->  HashAggregate  (cost=450.87..488.49 rows=627 width=4) (actual time=13.059..13.447 rows=67 loops=1)
          Filter: (count(*) > 1)
          ->  Bitmap Heap Scan on adratt  (cost=97.66..394.81 rows=2803 width=4) (actual time=0.252..6.252 rows=2811 loops=1)
                Recheck Cond: (att_id = ANY ('{10,14}'::integer[]))
                ->  Bitmap Index Scan on adratt_att_id_idx  (cost=0.00..94.86 rows=2803 width=0) (actual time=0.226..0.226 rows=2811 loops=1)
                      Index Cond: (att_id = ANY ('{10,14}'::integer[]))
  ->  CTE Scan on two  (cost=0.00..50.16 rows=627 width=4) (actual time=13.065..13.677 rows=67 loops=1)
  ->  Hash  (cost=384.68..384.68 rows=5767 width=295) (actual time=23.097..23.097 rows=5767 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 1153kB
        ->  Seq Scan on adr a  (cost=0.00..384.68 rows=5767 width=295) (actual time=0.005..10.955 rows=5767 loops=1)
Total runtime: 37.482 ms

答案 10 :(得分:1)

@ erwin-brandstetter请以此为基准:

SELECT s.stud_id, s.name
FROM   student s, student_club x, student_club y
WHERE  x.club_id = 30
AND    s.stud_id = x.stud_id
AND    y.club_id = 50
AND    s.stud_id = y.stud_id;

这就像@sean的数字6),我想是更干净。

答案 11 :(得分:0)

-- EXPLAIN ANALYZE
WITH two AS (
    SELECT c0.student_id
    FROM tmp.student_club c0
    , tmp.student_club c1
    WHERE c0.student_id = c1.student_id
    AND c0.club_id = 30
    AND c1.club_id = 50
    )
SELECT st.* FROM tmp.student st
JOIN two ON (two.student_id=st.id)
    ;

查询计划:

 Hash Join  (cost=1904.76..1919.09 rows=337 width=15) (actual time=6.937..8.771 rows=324 loops=1)
   Hash Cond: (two.student_id = st.id)
   CTE two
     ->  Hash Join  (cost=849.97..1645.76 rows=337 width=4) (actual time=4.932..6.488 rows=324 loops=1)
           Hash Cond: (c1.student_id = c0.student_id)
           ->  Bitmap Heap Scan on student_club c1  (cost=32.76..796.94 rows=1614 width=4) (actual time=0.667..1.835 rows=1646 loops=1)
                 Recheck Cond: (club_id = 50)
                 ->  Bitmap Index Scan on sc_club_id_idx  (cost=0.00..32.36 rows=1614 width=0) (actual time=0.473..0.473 rows=1646 loops=1)                     
                       Index Cond: (club_id = 50)
           ->  Hash  (cost=797.00..797.00 rows=1617 width=4) (actual time=4.203..4.203 rows=1620 loops=1)
                 Buckets: 1024  Batches: 1  Memory Usage: 57kB
                 ->  Bitmap Heap Scan on student_club c0  (cost=32.79..797.00 rows=1617 width=4) (actual time=0.663..3.596 rows=1620 loops=1)                   
                       Recheck Cond: (club_id = 30)
                       ->  Bitmap Index Scan on sc_club_id_idx  (cost=0.00..32.38 rows=1617 width=0) (actual time=0.469..0.469 rows=1620 loops=1)
                             Index Cond: (club_id = 30)
   ->  CTE Scan on two  (cost=0.00..6.74 rows=337 width=4) (actual time=4.935..6.591 rows=324 loops=1)
   ->  Hash  (cost=159.00..159.00 rows=8000 width=15) (actual time=1.979..1.979 rows=8000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 374kB
         ->  Seq Scan on student st  (cost=0.00..159.00 rows=8000 width=15) (actual time=0.093..0.759 rows=8000 loops=1)
 Total runtime: 8.989 ms
(20 rows)

所以它似乎仍然希望对学生进行seq扫描。

答案 12 :(得分:0)

SELECT s.stud_id, s.name
FROM   student s,
(
select x.stud_id from 
student_club x 
JOIN   student_club y ON x.stud_id = y.stud_id
WHERE  x.club_id = 30
AND    y.club_id = 50
) tmp_tbl
where tmp_tbl.stud_id = s.stud_id
;

使用最快的变体(肖恩先生在Brandstetter先生的图表中)。可能是只有一个连接的变体,只有student_club矩阵才有权生活。因此,最长的查询将只有两列要计算,想法是使查询变薄。