慢从属子查询的问题

时间:2012-11-05 22:01:08

标签: mysql sql optimization

我有一个查询我在MySQL中运行。如您所见,查询的每个部分都在索引字段上。然而,查询需要永远(几十分钟,比我愿意等待的时间长)。 Connect表由两个整数和两个索引组成(一个字段一个,字段二,另一个字段2,字段一)。源和目标是具有单个索引int字段的表。鉴于所有索引,我希望这个查询在几秒钟内完成。关于1的任何建议:为什么花了这么长时间,2:如何让它更快?

谢谢!

mysql> explain 
SELECT DISTINCT geneConnect.geneSymbolID FROM SNPEffectGeneConnector AS geneConnect 
  JOIN IndelSNPEffectConnector AS snpEConnect ON geneConnect.snpEffectID = snpEConnect.snpEffectID 
  JOIN InDels2 AS source ON source.id = snpEConnect.indelID 
  WHERE geneConnect.geneSymbolID NOT IN (
    SELECT geneConnect.geneSymbolID FROM SNPEffectGeneConnector AS geneConnect 
    JOIN IndelSNPEffectConnector AS snpEConnect ON geneConnect.snpEffectID = snpEConnect.snpEffectID 
    JOIN InDels3 AS target ON target.id = snpEConnect.indelID);
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
| id | select_type        | table       | type  | possible_keys     | key      | key_len | ref                                                                   | rows | Extra                          |
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
|  1 | PRIMARY            | source      | index | id                | id       | 4       | NULL                                                                  | 5771 | Using index; Using temporary   |
|  1 | PRIMARY            | snpEConnect | ref   | snpEList          | snpEList | 4       | treattablebrowser.source.id                                           |    2 | Using index                    |
|  1 | PRIMARY            | geneConnect | ref   | snpEList          | snpEList | 4       | treattablebrowser.snpEConnect.snpEffectID                             |    1 | Using where; Using index       |
|  2 | DEPENDENT SUBQUERY | geneConnect | ref   | snpEList,geneList | geneList | 4       | func                                                                  |    1 | Using index                    |
|  2 | DEPENDENT SUBQUERY | target      | index | id                | id       | 4       | NULL                                                                  | 6297 | Using index; Using join buffer |
|  2 | DEPENDENT SUBQUERY | snpEConnect | ref   | snpEList          | snpEList | 8       | treattablebrowser.target.id,treattablebrowser.geneConnect.snpEffectID |    1 | Using index                    |
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+

6行(0.01秒)

4 个答案:

答案 0 :(得分:3)

我想这很大程度上是学术上的兴趣,现在格雷格自己解决了这个问题。很高兴知道我对这些事情的直觉可以彻底打破。我仍然可以通过三种方式重写这一点。我认为第一个可以简化,但正如格雷格指出的那样,简化不起作用。不确定这是否会比原始版本更快,尽管它在我的sql server测试中确实产生了不同的查询计划。

Select Distinct
    g1.geneSymbolID 
From
    SNPEffectGeneConnector AS g1 
        Inner Join
    IndelSNPEffectConnector AS s1 
        ON g1.snpEffectID = s1.snpEffectID 
        Inner Join
    InDels2 AS i2 ON i2.id = s1.indelID 
Where Not Exists (
    Select 'x'
        From
            SNPEffectGeneConnector As g2
                Inner Join
            IndelSNPEffectConnector AS s2 
                On g2.snpEffectID = s2.snpEffectID 
                Inner Join
            InDels3 As i3
                On i3.id = s2.indelID
        Where
            g2.geneSymbolID = g1.geneSymbolID
    );

我对第二种方式并不是100%肯定,但它适用于我非常少量的测试数据。它有一个更短的查询计划,如果它工作(不一定更快,但一个很好的指示):

Select
  geneSymbolID
From
  SNPEffectGeneConnector As g
    Inner Join 
  IndelSNPEffectConnector As s
    ON g.snpEffectID = s.snpEffectID 
    Left Outer Join
  InDels2 As i2 
    On i2.id = s.indelID 
    Left Outer Join
  InDels3 As i3
    On i3.id = s.indelID
Group By
    geneSymbolID
Having
    count(i2.id) > 0 And
    count(i3.id) = 0

另一种方法(对非描述性别名表示道歉):

Select
    g.geneSymbolID
From
    SNPEffectGeneConnector As g
        Inner Join
    IndelSNPEffectConnector AS s
        On g.snpEffectID = s.snpEffectID 
        Inner Join (
        Select 
            i2.id,
            0 As c
        From    
            InDels2 i2
        Union All
        Select
            i3.id,
            1
        From
            InDels3 i3
    ) as i23
    on s.indelID = i23.id
Group By
    g.geneSymbolID
Having
    max(i23.c) = 0;

http://sqlfiddle.com/#!2/944e1/10

答案 1 :(得分:0)

    SELECT DISTINCT geneConnect.geneSymbolID 
    FROM SNPEffectGeneConnector AS geneConnect 
      JOIN IndelSNPEffectConnector AS snpEConnect 
          ON geneConnect.snpEffectID = snpEConnect.snpEffectID 
      JOIN InDels2 AS source ON source.id = snpEConnect.indelID
      LEFT OUTER JOIN InDels3 AS target ON target.id = snpEConnect.indelID
    WHERE target.id is null

上述查询应与您的查询等效,并为您提供更好的性能。

答案 2 :(得分:0)

如果我理解正确,您希望在geneSymbolID中找到SNPEffectGeneConnector中包含IndelSNPEffectConnector条目的所有indelID,以便执行匹配( InDels2中的indelID},但InDels3中与 LEFT JOIN相对应的匹配

然后你可以运行查询的第一部分(“do”部分),然后进一步加入最后一部分,从而收集 匹配的所有基因。具有匹配失败的基因符号表的SELECT DISTINCT genes.geneSymbolID FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes JOIN SNPEffectGeneConnector AS effectSource ON ( genes.geneSymbolID = effectSource.geneSymbolID) JOIN SNPEffectGeneConnector AS effectTarget ON ( genes.geneSymbolID = effectTarget.geneSymbolID) JOIN IndelSNPEffectConnector AS indelSource ON ( indelSource.snpEffectID = effectSource.snpEffectID ) JOIN IndelSNPEffectConnector AS indelTarget ON ( indelTarget.snpEffectID = effectTarget.snpEffectID ) JOIN InDels2 ON ( indelSource.indelId = InDels2.id ) JOIN InDels3 ON ( indelTarget.indelId = InDels3.id ) ; 将产生所有未通过反向标准的基因,因此感兴趣。

修订答案

这是匹配的查询

CREATE INDEX SNPEffectGeneConnector_ndx
    ON SNPEffectGeneConnector(snpEffectID, geneSymbolID);

CREATE INDEX SNPEffectGeneConnector_ndx2
    ON SNPEffectGeneConnector(geneSymbolID);

CREATE INDEX IndelSNPEffectConnector_ndx
    ON IndelSNPEffectConnector(snpEffectID, indelID);
CREATE [UNIQUE?] INDEX InDels2_ndx ON InDels2(id); -- unless id is primary key
CREATE [UNIQUE?] INDEX InDels3_ndx ON InDels3(id); -- unless id is primary key

现在,对于这个查询,我认为你需要这些索引:

SELECT glob.geneSymbolID
    FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS glob
    LEFT JOIN (
SELECT DISTINCT genes.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
    ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
    ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
    ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
    ON ( indelTarget.snpEffectID = effectTarget.snpEffectID ) 
     JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
     JOIN InDels3 ON ( indelTarget.indelId = InDels3.id )
) AS fits ON (glob.geneSymbolID = fits.geneSymbolID)
WHERE fits.geneSymbolID IS NULL;

获得感兴趣的基因:

CREATE TABLE InDels2 ( id integer );
INSERT INTO InDels2 VALUES ( 1 );
CREATE TABLE InDels3 ( id integer );
INSERT INTO InDels3 VALUES ( 2 );
CREATE TABLE IndelSNPEffectConnector ( indelId integer, snpEffectID integer );
INSERT INTO IndelSNPEffectConnector VALUES ( 1, 55 ), ( 2, 88 );
CREATE TABLE SNPEffectGeneConnector ( geneSymbolID integer, snpEffectID integer );
INSERT INTO SNPEffectGeneConnector VALUES ( 100, 55 ), ( 100, 88 );

测试

inDels3

因此,因为基因100连接到55,其连接到1,因此在ID2中注明, 但它也连接到88连接到2,因此在ID3中,它必须不连接 出现。

会出现什么?如果我理解了这些要求,我们需要一个基因,产生一种效果,其插入符号列在inDels3中。因此,例如,基因42,导致效果77,与INSERT INTO SNPEffectGeneConnector VALUES ( 42, 55 ); INSERT INTO SNPEffectGeneConnector VALUES ( 42, 77 ); INSERT INTO IndelSNPEffectConnector VALUES ( 3, 77 ); 的indel 3相关联,必须出现。

所以:

+--------------+
| geneSymbolID |
+--------------+
|           42 |
+--------------+

产量

SELECT genes.geneSymbolID, effectSource.snpEffectID, effectTarget.snpEffectID, indelSource.indelId AS sourceInDel, indelTarget.indelId AS targetInDel, InDels3.id
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
 JOIN SNPEffectGeneConnector AS effectSource
     ON ( genes.geneSymbolID = effectSource.geneSymbolID)
 JOIN SNPEffectGeneConnector AS effectTarget
     ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
 JOIN IndelSNPEffectConnector AS indelSource
     ON ( indelSource.snpEffectID = effectSource.snpEffectID )
 JOIN IndelSNPEffectConnector AS indelTarget
     ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )

      JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
 LEFT JOIN InDels3 ON ( indelTarget.indelId = InDels3.id );

+--------------+-------------+-------------+-------------+-------------+------+
| geneSymbolID | snpEffectID | snpEffectID | sourceInDel | targetInDel | id   |
+--------------+-------------+-------------+-------------+-------------+------+
|           42 |          55 |          55 |           1 |           1 | NULL |
|           42 |          55 |          77 |           1 |           3 | NULL |
|          100 |          55 |          55 |           1 |           1 | NULL |
|          100 |          55 |          88 |           1 |           2 |    2 |
+--------------+-------------+-------------+-------------+-------------+------+

可以使用第一个查询的修改来检查42为什么,而100不是:

{{1}}

... 100有一条InDels3的ID不为空的行,它会报告目标indel 2.

答案 3 :(得分:0)

事实证明,问题是虽然所有内容都有索引,但子查询返回的基因ID 有索引。加入/进行IN搜索非索引数字集合的表现非常糟糕,这就是我所得到的。

我的解决方案是分别执行外部和内部JOIN,将结果转储到两个不同的索引表中,然后删除1中同样为2的geneID。

故事的寓意:永远不要加入或反对未经索引的任何东西。