我有一个查询我在MySQL中运行。如您所见,查询的每个部分都在索引字段上。然而,查询需要永远(几十分钟,比我愿意等待的时间长)。 Connect表由两个整数和两个索引组成(一个字段一个,字段二,另一个字段2,字段一)。源和目标是具有单个索引int字段的表。鉴于所有索引,我希望这个查询在几秒钟内完成。关于1的任何建议:为什么花了这么长时间,2:如何让它更快?
谢谢!
mysql> explain
SELECT DISTINCT geneConnect.geneSymbolID FROM SNPEffectGeneConnector AS geneConnect
JOIN IndelSNPEffectConnector AS snpEConnect ON geneConnect.snpEffectID = snpEConnect.snpEffectID
JOIN InDels2 AS source ON source.id = snpEConnect.indelID
WHERE geneConnect.geneSymbolID NOT IN (
SELECT geneConnect.geneSymbolID FROM SNPEffectGeneConnector AS geneConnect
JOIN IndelSNPEffectConnector AS snpEConnect ON geneConnect.snpEffectID = snpEConnect.snpEffectID
JOIN InDels3 AS target ON target.id = snpEConnect.indelID);
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
| 1 | PRIMARY | source | index | id | id | 4 | NULL | 5771 | Using index; Using temporary |
| 1 | PRIMARY | snpEConnect | ref | snpEList | snpEList | 4 | treattablebrowser.source.id | 2 | Using index |
| 1 | PRIMARY | geneConnect | ref | snpEList | snpEList | 4 | treattablebrowser.snpEConnect.snpEffectID | 1 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | geneConnect | ref | snpEList,geneList | geneList | 4 | func | 1 | Using index |
| 2 | DEPENDENT SUBQUERY | target | index | id | id | 4 | NULL | 6297 | Using index; Using join buffer |
| 2 | DEPENDENT SUBQUERY | snpEConnect | ref | snpEList | snpEList | 8 | treattablebrowser.target.id,treattablebrowser.geneConnect.snpEffectID | 1 | Using index |
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
6行(0.01秒)
答案 0 :(得分:3)
我想这很大程度上是学术上的兴趣,现在格雷格自己解决了这个问题。很高兴知道我对这些事情的直觉可以彻底打破。我仍然可以通过三种方式重写这一点。我认为第一个可以简化,但正如格雷格指出的那样,简化不起作用。不确定这是否会比原始版本更快,尽管它在我的sql server测试中确实产生了不同的查询计划。
Select Distinct
g1.geneSymbolID
From
SNPEffectGeneConnector AS g1
Inner Join
IndelSNPEffectConnector AS s1
ON g1.snpEffectID = s1.snpEffectID
Inner Join
InDels2 AS i2 ON i2.id = s1.indelID
Where Not Exists (
Select 'x'
From
SNPEffectGeneConnector As g2
Inner Join
IndelSNPEffectConnector AS s2
On g2.snpEffectID = s2.snpEffectID
Inner Join
InDels3 As i3
On i3.id = s2.indelID
Where
g2.geneSymbolID = g1.geneSymbolID
);
我对第二种方式并不是100%肯定,但它适用于我非常少量的测试数据。它有一个更短的查询计划,如果它工作(不一定更快,但一个很好的指示):
Select
geneSymbolID
From
SNPEffectGeneConnector As g
Inner Join
IndelSNPEffectConnector As s
ON g.snpEffectID = s.snpEffectID
Left Outer Join
InDels2 As i2
On i2.id = s.indelID
Left Outer Join
InDels3 As i3
On i3.id = s.indelID
Group By
geneSymbolID
Having
count(i2.id) > 0 And
count(i3.id) = 0
另一种方法(对非描述性别名表示道歉):
Select
g.geneSymbolID
From
SNPEffectGeneConnector As g
Inner Join
IndelSNPEffectConnector AS s
On g.snpEffectID = s.snpEffectID
Inner Join (
Select
i2.id,
0 As c
From
InDels2 i2
Union All
Select
i3.id,
1
From
InDels3 i3
) as i23
on s.indelID = i23.id
Group By
g.geneSymbolID
Having
max(i23.c) = 0;
答案 1 :(得分:0)
SELECT DISTINCT geneConnect.geneSymbolID
FROM SNPEffectGeneConnector AS geneConnect
JOIN IndelSNPEffectConnector AS snpEConnect
ON geneConnect.snpEffectID = snpEConnect.snpEffectID
JOIN InDels2 AS source ON source.id = snpEConnect.indelID
LEFT OUTER JOIN InDels3 AS target ON target.id = snpEConnect.indelID
WHERE target.id is null
上述查询应与您的查询等效,并为您提供更好的性能。
答案 2 :(得分:0)
如果我理解正确,您希望在geneSymbolID
中找到SNPEffectGeneConnector
中包含IndelSNPEffectConnector
条目的所有indelID
,以便执行匹配( InDels2
中的indelID
},但不在InDels3
中与 LEFT JOIN
相对应的匹配。
然后你可以运行查询的第一部分(“do”部分),然后进一步加入最后一部分,从而收集 匹配的所有基因。具有匹配失败的基因符号表的SELECT DISTINCT genes.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )
JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
JOIN InDels3 ON ( indelTarget.indelId = InDels3.id )
;
将产生所有未通过反向标准的基因,因此感兴趣。
这是匹配的查询:
CREATE INDEX SNPEffectGeneConnector_ndx
ON SNPEffectGeneConnector(snpEffectID, geneSymbolID);
CREATE INDEX SNPEffectGeneConnector_ndx2
ON SNPEffectGeneConnector(geneSymbolID);
CREATE INDEX IndelSNPEffectConnector_ndx
ON IndelSNPEffectConnector(snpEffectID, indelID);
CREATE [UNIQUE?] INDEX InDels2_ndx ON InDels2(id); -- unless id is primary key
CREATE [UNIQUE?] INDEX InDels3_ndx ON InDels3(id); -- unless id is primary key
现在,对于这个查询,我认为你需要这些索引:
SELECT glob.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS glob
LEFT JOIN (
SELECT DISTINCT genes.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )
JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
JOIN InDels3 ON ( indelTarget.indelId = InDels3.id )
) AS fits ON (glob.geneSymbolID = fits.geneSymbolID)
WHERE fits.geneSymbolID IS NULL;
获得感兴趣的基因:
CREATE TABLE InDels2 ( id integer );
INSERT INTO InDels2 VALUES ( 1 );
CREATE TABLE InDels3 ( id integer );
INSERT INTO InDels3 VALUES ( 2 );
CREATE TABLE IndelSNPEffectConnector ( indelId integer, snpEffectID integer );
INSERT INTO IndelSNPEffectConnector VALUES ( 1, 55 ), ( 2, 88 );
CREATE TABLE SNPEffectGeneConnector ( geneSymbolID integer, snpEffectID integer );
INSERT INTO SNPEffectGeneConnector VALUES ( 100, 55 ), ( 100, 88 );
inDels3
因此,因为基因100连接到55,其连接到1,因此在ID2中注明, 但它也连接到88连接到2,因此在ID3中,它必须不连接 出现。
会出现什么?如果我理解了这些要求,我们需要一个基因,产生一种效果,其插入符号不列在inDels3
中。因此,例如,基因42,导致效果77,与INSERT INTO SNPEffectGeneConnector VALUES ( 42, 55 );
INSERT INTO SNPEffectGeneConnector VALUES ( 42, 77 );
INSERT INTO IndelSNPEffectConnector VALUES ( 3, 77 );
中不的indel 3相关联,必须出现。
所以:
+--------------+
| geneSymbolID |
+--------------+
| 42 |
+--------------+
产量
SELECT genes.geneSymbolID, effectSource.snpEffectID, effectTarget.snpEffectID, indelSource.indelId AS sourceInDel, indelTarget.indelId AS targetInDel, InDels3.id
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )
JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
LEFT JOIN InDels3 ON ( indelTarget.indelId = InDels3.id );
+--------------+-------------+-------------+-------------+-------------+------+
| geneSymbolID | snpEffectID | snpEffectID | sourceInDel | targetInDel | id |
+--------------+-------------+-------------+-------------+-------------+------+
| 42 | 55 | 55 | 1 | 1 | NULL |
| 42 | 55 | 77 | 1 | 3 | NULL |
| 100 | 55 | 55 | 1 | 1 | NULL |
| 100 | 55 | 88 | 1 | 2 | 2 |
+--------------+-------------+-------------+-------------+-------------+------+
可以使用第一个查询的修改来检查42为什么,而100不是:
{{1}}
... 100有一条InDels3的ID不为空的行,它会报告目标indel 2.
答案 3 :(得分:0)
事实证明,问题是虽然所有内容都有索引,但子查询返回的基因ID 不有索引。加入/进行IN搜索非索引数字集合的表现非常糟糕,这就是我所得到的。
我的解决方案是分别执行外部和内部JOIN,将结果转储到两个不同的索引表中,然后删除1中同样为2的geneID。
故事的寓意:永远不要加入或反对未经索引的任何东西。