geneHomology
============
id genome_name gene_id homolog_genome_name homolog_gene_id consider_homolog
1 HomoSap 1007 MusMus 824 1
2 HomoSap 1007 MusMus 825 1
3 HomoSap 1007 MusMus 826 1
4 HomoSap 2890 EColi 2140 1
...
gene
====
genome_name gene_id gene_category
MusMus 823 Upregulated
MusMus 824 Downregulated
MusMus 825 Normal
MusMus 826 Normal
MusMus 827 Upregulated
EColi 2140 Normal
...
consider_homolog
是一个枚举(0,1)。 genome_name
和gene_id
是gene
中的主键。 geneHomology
非常大 - 大约200M行。
我的目标是计算genes
中每个基因从每个gene_category
中得到多少同源物。
例如,根据上述数据,HomoSap 1007
有3个Normal
个同源词和1个Downregulated
。
所以我的查询是:
SELECT a.id,a.genome_name,a.gene_id,a.homolog_genome_name,a.homolog_gene_id,COUNT(b.gene_category)
FROM geneHomology a,gene b
WHERE a.consider_homolog='1' AND a.homolog_genome_name=b.genome_name AND a.homolog_gene_id=b.gene_id
GROUP BY a.genome_name,a.gene_id,b.gene_category;
永远不会回来(我耐心等待了一个多小时)。
我已将gene_category
中的gene
编入索引。
我是MySQL新手,但我可以根本访问数据库,所以我可以按照你的建议(仔细...)。我很乐意提供更多信息。
更新
这是查询的EXPLAIN
输出:
+----+-------------+-------+------+-----------------------+----------------------+---------+----------------------------------------------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------+----------------------+---------+----------------------------------------------------------+---------+---------------------------------+
| 1 | SIMPLE | b | ALL | PRIMARY,gene_genome | NULL | NULL | NULL | 1560695 | Using temporary; Using filesort |
| 1 | SIMPLE | a | ref | geneHomologyHit_gene | geneHomologyHit_gene | 54 | my_db_v71.b.gene_id,my_db_v71.b.genome_name | 13 | Using where |
+----+-------------+-------+------+-----------------------+----------------------+---------+----------------------------------------------------------+---------+---------------------------------+
更新2
mysql> SHOW INDEX FROM gene;
+-------+------------+--------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+--------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+
| gene | 0 | PRIMARY | 1 | gene_id | A | NULL | NULL | NULL | | BTREE | |
| gene | 0 | PRIMARY | 2 | genome_name | A | 1560695 | NULL | NULL | | BTREE | |
| gene | 1 | gene_organism | 1 | taxon_id | A | 392 | NULL | NULL | | BTREE | |
| gene | 1 | gene_genome | 1 | genome_name | A | 853 | NULL | NULL | | BTREE | |
| gene | 1 | gene_gene_category | 1 | gene_category | A | 5 | NULL | NULL | | BTREE | |
+-------+------------+--------------------------+--------------+---------------------+-----------+-------------+----------+--------+------+------------+---------+
5 rows in set (0.01 sec)
更新3
mysql> SHOW INDEX FROM geneHomology;
+--------------+------------+------------------------+--------------+--------------------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+------------------------+--------------+--------------------------+-----------+-------------+----------+--------+------+------------+---------+
| geneHomology | 0 | PRIMARY | 1 | id | A | 680326661 | NULL | NULL | | BTREE | |
| geneHomology | 1 | geneHomologyQuery_gene | 1 | gene_id | A | 1498516 | NULL | NULL | | BTREE | |
| geneHomology | 1 | geneHomologyQuery_gene | 2 | genome_name | A | 1505147 | NULL | NULL | | BTREE | |
| geneHomology | 1 | geneHomologyHit_gene | 1 | homolog_gene_id | A | 52332820 | NULL | NULL | | BTREE | |
| geneHomology | 1 | geneHomologyHit_gene | 2 | homolog_genome_name | A | 52332820 | NULL | NULL | | BTREE | |
+--------------+------------+------------------------+--------------+--------------------------+-----------+-------------+----------+--------+------+------------+---------+
5 rows in set (0.00 sec)
更新4
有没有办法只获得部分结果,甚至看到我得到了我想要的东西?我尝试LIMIT 1000
甚至LIMIT 10
,但似乎没有改变任何内容。
更新5
mysql> SHOW CREATE TABLE geneHomology;
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| geneHomology | CREATE TABLE `geneHomology` (
`id` bigint(20) NOT NULL auto_increment,
`genome_name` varchar(20) NOT NULL,
`gene_id` varchar(30) NOT NULL,
`homolog_genome_name` varchar(20) NOT NULL,
`homolog_gene_id` varchar(30) NOT NULL,
`homolog_length` bigint(20) unsigned NOT NULL,
`significance` double unsigned NOT NULL,
`bit_score` double unsigned NOT NULL,
`percent_identity` double unsigned NOT NULL,
`start_match` int(10) unsigned NOT NULL,
`end_match` int(10) unsigned NOT NULL,
`start_match_percent` double unsigned NOT NULL,
`end_match_percent` double unsigned NOT NULL,
`strand` enum('+','-') default NULL,
`homolog_start_match` int(10) unsigned NOT NULL,
`homolog_end_match` int(10) unsigned NOT NULL,
`homolog_start_match_percent` double unsigned NOT NULL,
`homolog_end_match_percent` double unsigned NOT NULL,
`homolog_strand` enum('+','-') default NULL,
`consider_gene_homology` enum('0','1') NOT NULL,
`reason_not_considered` varchar(50) default NULL,
`num_hsps` int(10) unsigned NOT NULL,
`homology_type` varchar(2) NOT NULL,
PRIMARY KEY (`id`),
KEY `geneHomologygene` (`gene_id`,`genome_name`),
KEY `geneHomologyhomolog_gene` (`homolog_gene_id`,`homolog_genome_name`)
) ENGINE=MyISAM AUTO_INCREMENT=680326662 DEFAULT CHARSET=latin1 |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> SHOW CREATE TABLE gene;
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| gene | CREATE TABLE `gene` (
`taxon_id` int(10) unsigned NOT NULL,
`genome_name` varchar(20) NOT NULL,
`gene_id` varchar(30) NOT NULL,
`symbol` varchar(30) default NULL,
`type` varchar(30) default NULL,
`product` varchar(300) default NULL,
`strand` enum('+','-') NOT NULL,
`start` bigint(20) unsigned NOT NULL,
`end` bigint(20) unsigned NOT NULL,
`gene_category` enum('Upregulated','Downregulated','Normal','n/a') NOT NULL,
`consider_gene` enum('0','1') NOT NULL,
`reason_not_considered` varchar(50) default NULL,
`sequence` longblob NOT NULL,
`additional_info` varchar(300) default NULL,
PRIMARY KEY (`gene_id`,`genome_name`),
KEY `gene_organism` (`taxon_id`),
KEY `gene_genome` (`genome_name`),
KEY `gene_gene_category` (`gene_category`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
答案 0 :(得分:1)
SELECT a.genome_name, a.gene_id,
cats.gene_category,
(
SELECT COUNT(*)
FROM geneHomology ab
JOIN gene b
ON b.genome_name = ab.homolog_genome_name
AND b.gene_id = ab.homolog_gene_id
WHERE ab.genome_name = a.genome_name
AND ab.gene_id = a.gene_id
AND b.gene_category = cats.gene_category
) cx
FROM gene a
CROSS JOIN
(
SELECT 'Normal' AS gene_category
UNION ALL
SELECT 'Upregulated' AS gene_category
UNION ALL
SELECT 'Downregulated' AS gene_category
) cats
LIMIT 100
这会从您的计划中移除filesort
。
如果您的表格中包含所有可能的gene_categories
,请将cats
替换为它。
答案 1 :(得分:0)
根据你在这里发布的内容,我会推荐一点点非规范化,并将gene_category放入geneHomolgy中。然后,您可以完全摆脱连接,并且可以在Conside_homolog + GROUP BY字段上创建索引。
答案 2 :(得分:0)
首先从查询的WHERE部分删除对genome_name的引用 - 如果gene.gene_id和gene.genome_name都是唯一的,那么这里有一个明确的功能依赖,这有点混淆了这个问题 - 数字/数字连接将是略微提高文本/文本加入的效率。
看一下这个计划,它意味着你已经获得了geneHomology.hit_gene_id的索引。如果是这种情况,则没有太多的空间可以在没有架构更改的情况下使查询更快。然而,密钥长度为54表明你在该指数中有很多东西不应该存在。将其简化为hit_gene_id和consideration_homolog将对性能有所帮助,但限制因素是除非存在其他功能依赖性,否则似乎没有办法避免对基因进行全表扫描。
完成“SELECT * FROM gene”需要多快? ene_homology中有多少条记录?
看起来geneHomology分解基因和基因(本身)之间的N:M关系并应用标签。
如果homolog_genome_name中的值的数量相对较小,您可以考虑使用位图字段将其分解为基因。或者可能将关系归一化为一组1:1映射。或者,您可以枚举同源群集。