我有一个相当复杂的select语句建立在Chado schema的基础上,运行得有点高效。目前使用我的数据子集处理查询大约需要5秒钟。完整的数据集可能会超过一百倍,而我担心计算时间会非常慢。我被建议使用索引来提高性能,但我不完全确定会涉及到什么。
我的查询:
SELECT dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value AS isolation_source, bp2.value AS specimen_collection_date,
bp3.value AS collection_location_name, bp4.value AS genotype, refeat.seqlen, string_agg(feat.name, ', ' order by feat.name) AS tranlation_type, refeat.residues
FROM featureloc
INNER JOIN feature srcfeat ON srcfeat.feature_id = featureloc.srcfeature_id
INNER JOIN feature feat ON feat.feature_id = featureloc.feature_id
RIGHT JOIN dbxref ON dbxref.dbxref_id = srcfeat.dbxref_id
INNER JOIN feature refeat ON refeat.dbxref_id = dbxref.dbxref_id
INNER JOIN dbxrefprop ON dbxrefprop.dbxref_id = dbxref.dbxref_id
INNER JOIN biomaterial ON biomaterial.dbxref_id = dbxref.dbxref_id
INNER JOIN biomaterialprop bp1 ON (bp1.biomaterial_id = biomaterial.biomaterial_id and bp1.type_id = 2916)
INNER JOIN biomaterialprop bp2 ON (bp2.biomaterial_id = biomaterial.biomaterial_id and bp2.type_id = 2917)
INNER JOIN biomaterialprop bp3 ON (bp3.biomaterial_id = biomaterial.biomaterial_id and bp3.type_id = 2918)
INNER JOIN biomaterialprop bp4 ON (bp4.biomaterial_id = biomaterial.biomaterial_id and bp4.type_id = 2919)
INNER JOIN contact ON contact.contact_id = biomaterial.biosourceprovider_id
GROUP BY dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value, bp2.value, bp3.value, bp4.value, refeat.seqlen, refeat.residues
HAVING (bp1.value = 'Alveolar Macrophage')
ORDER BY dbxref.accession;
解释输出(没有HAVING行):
GroupAggregate (cost=627.81..631.98 rows=98 width=361)
Group Key: dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value, bp2.value, bp3.value, bp4.value, refeat.seqlen, refeat.residues
-> Sort (cost=627.81..628.06 rows=98 width=361)
Sort Key: dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value, bp2.value, bp3.value, bp4.value, refeat.seqlen, refeat.residues
-> Hash Join (cost=11.42..624.57 rows=98 width=361)
Hash Cond: (biomaterial.biosourceprovider_id = contact.contact_id)
-> Nested Loop (cost=3.11..614.92 rows=98 width=316)
-> Nested Loop (cost=2.83..563.01 rows=98 width=344)
-> Nested Loop (cost=2.54..511.11 rows=98 width=332)
-> Nested Loop (cost=2.26..459.20 rows=98 width=320)
-> Nested Loop Left Join (cost=1.98..407.30 rows=98 width=308)
-> Nested Loop (cost=1.12..148.51 rows=98 width=309)
-> Merge Join (cost=0.84..88.36 rows=164 width=312)
Merge Cond: (refeat.dbxref_id = dbxrefprop.dbxref_id)
-> Merge Join (cost=0.56..188.11 rows=1400 width=296)
Merge Cond: (refeat.dbxref_id = biomaterial.dbxref_id)
-> Index Scan using feature_idx1 on feature refeat (cost=0.29..884.42 rows=11936 width=264)
-> Index Scan using biomaterial_idx3 on biomaterial (cost=0.28..63.28 rows=1400 width=32)
-> Index Scan using dbxrefprop_idx1 on dbxrefprop (cost=0.28..60.28 rows=1400 width=16)
-> Index Scan using dbxref_pkey on dbxref (cost=0.29..0.36 rows=1 width=21)
Index Cond: (dbxref_id = refeat.dbxref_id)
-> Nested Loop (cost=0.86..2.63 rows=1 width=15)
-> Nested Loop (cost=0.57..2.06 rows=1 width=16)
-> Index Scan using feature_idx1 on feature srcfeat (cost=0.29..0.46 rows=1 width=16)
Index Cond: (dbxref.dbxref_id = dbxref_id)
-> Index Scan using featureloc_idx2 on featureloc (cost=0.29..1.14 rows=46 width=16)
Index Cond: (srcfeature_id = srcfeat.feature_id)
-> Index Scan using feature_pkey on feature feat (cost=0.29..0.56 rows=1 width=15)
Index Cond: (feature_id = featureloc.feature_id)
-> Index Scan using biomaterialprop_c1 on biomaterialprop bp1 (cost=0.28..0.52 rows=1 width=12)
Index Cond: ((biomaterial_id = biomaterial.biomaterial_id) AND (type_id = 2916))
-> Index Scan using biomaterialprop_c1 on biomaterialprop bp2 (cost=0.28..0.52 rows=1 width=12)
Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2917))
-> Index Scan using biomaterialprop_c1 on biomaterialprop bp3 (cost=0.28..0.52 rows=1 width=12)
Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2918))
-> Index Scan using biomaterialprop_c1 on biomaterialprop bp4 (cost=0.28..0.52 rows=1 width=12)
Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2919))
-> Hash (cost=5.36..5.36 rows=236 width=61)
-> Seq Scan on contact (cost=0.00..5.36 rows=236 width=61)
基于我对口译解释输出的有限理解,以下几行是可疑的:
-> Index Scan using feature_idx1 on feature refeat (cost=0.29..884.42 rows=11936 width=264)
-> Index Scan using biomaterial_idx3 on biomaterial (cost=0.28..63.28 rows=1400 width=32)
-> Index Scan using dbxrefprop_idx1 on dbxrefprop (cost=0.28..60.28 rows=1400 width=16)
This demonstration(在MySQL中)表示索引涉及分配其他主键以提高查找联接的效率。以下是我提议的更改:
ALTER TABLE feature
ADD PRIMARY KEY (dbxref_id);
ALTER TABLE biomaterial
ADD PRIMARY KEY (dbxref_id);
ALTER TABLE dbxrefprop
ADD PRIMARY KEY (dbxref_id);
请注意,dbxref_id是引用dbxref表主键的所有三个表中的外键。这是否是改善计算时间的有效解决方案?而不是更改表,可以更改查询中的哪些行以进一步改进我的查询?带有“refeat”别名的内部连接要素表是必要的,以防止通过featureloc表链接要素的遗漏。
EDIT1
每个联接表的主键如下: featureloc = featureloc_id,feature = feature_id,dbxref = dbxref_id,dbxrefprop = dbxrefprop_id,biomaterial = biomaterial_id,biomaterialprop = biomaterialprop_id,contact = contact_id。
如果样本大小为1400,表格的行数如下: featureloc = 10536,feature = 11936,dbxref = 15492,dbxrefprop = 1400,biomaterial = 1400,biomaterialprop = 5600,contact = 236.请注意,某些表(dbxref)包含预加载的数据。
EDIT2
EXPLAIN(ANALYZE,BUFFERS)输出:
GroupAggregate (cost=522.10..526.26 rows=98 width=361) (actual time=7899.696..10201.445 rows=1400 loops=1)
Group Key: dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value, bp2.value, bp3.value, bp4.value, refeat.seqlen, refeat.residues
Buffers: shared hit=320702 read=1752
-> Sort (cost=522.10..522.34 rows=98 width=361) (actual time=7899.664..7940.350 rows=10606 loops=1)
Sort Key: dbxref.accession, dbxrefprop.value, contact.name, biomaterial.description, bp1.value, bp2.value, bp3.value, bp4.value, refeat.seqlen, refeat.residues
Sort Method: quicksort Memory: 3708kB
Buffers: shared hit=244651 read=1752
-> Hash Join (cost=11.42..518.86 rows=98 width=361) (actual time=0.406..5525.245 rows=10606 loops=1)
Hash Cond: (biomaterial.biosourceprovider_id = contact.contact_id)
Buffers: shared hit=171364 read=847
-> Nested Loop (cost=3.11..509.20 rows=98 width=316) (actual time=0.141..5201.920 rows=10606 loops=1)
Buffers: shared hit=171362 read=846
-> Nested Loop (cost=2.83..457.29 rows=98 width=344) (actual time=0.138..4258.617 rows=10606 loops=1)
Buffers: shared hit=139485 read=821
-> Nested Loop (cost=2.54..405.39 rows=98 width=332) (actual time=0.135..3082.229 rows=10606 loops=1)
Buffers: shared hit=107577 read=803
-> Nested Loop (cost=2.26..353.48 rows=98 width=320) (actual time=0.130..2179.420 rows=10606 loops=1)
Buffers: shared hit=75654 read=786
-> Nested Loop Left Join (cost=1.98..301.58 rows=98 width=308) (actual time=0.102..1566.105 rows=10606 loops=1)
Buffers: shared hit=43748 read=773
-> Nested Loop (cost=1.12..148.51 rows=98 width=309) (actual time=0.042..332.126 rows=1400 loops=1)
Buffers: shared hit=4283 read=96
-> Merge Join (cost=0.84..88.36 rows=164 width=312) (actual time=0.024..208.163 rows=1400 loops=1)
Merge Cond: (refeat.dbxref_id = dbxrefprop.dbxref_id)
Buffers: shared hit=78 read=93
-> Merge Join (cost=0.56..188.11 rows=1400 width=296) (actual time=0.016..165.490 rows=1400 loops=1)
Merge Cond: (refeat.dbxref_id = biomaterial.dbxref_id)
Buffers: shared hit=73 read=81
-> Index Scan using feature_idx1 on feature refeat (cost=0.29..884.42 rows=11936 width=264) (actual time=0.008..1.471 rows=1401 loops=1)
Buffers: shared hit=68 read=66
-> Index Scan using biomaterial_idx3 on biomaterial (cost=0.28..63.28 rows=1400 width=32) (actual time=0.005..118.944 rows=1400 loops=1)
Buffers: shared hit=5 read=15
-> Index Scan using dbxrefprop_idx1 on dbxrefprop (cost=0.28..60.28 rows=1400 width=16) (actual time=0.005..1.018 rows=1400 loops=1)
Buffers: shared hit=5 read=12
-> Index Scan using dbxref_pkey on dbxref (cost=0.29..0.36 rows=1 width=21) (actual time=0.057..0.058 rows=1 loops=1400)
Index Cond: (dbxref_id = refeat.dbxref_id)
Buffers: shared hit=4205 read=3
-> Nested Loop (cost=0.86..1.55 rows=1 width=15) (actual time=0.119..0.848 rows=8 loops=1400)
Buffers: shared hit=39465 read=677
-> Nested Loop (cost=0.57..1.01 rows=1 width=16) (actual time=0.061..0.511 rows=8 loops=1400)
Buffers: shared hit=8314 read=162
-> Index Scan using feature_idx1 on feature srcfeat (cost=0.29..0.46 rows=1 width=16) (actual time=0.057..0.110 rows=1 loops=1400)
Index Cond: (dbxref.dbxref_id = dbxref_id)
Buffers: shared hit=4203 read=3
-> Index Scan using featureloc_idx2 on featureloc (cost=0.29..0.48 rows=8 width=16) (actual time=0.002..0.088 rows=8 loops=1400)
Index Cond: (srcfeature_id = srcfeat.feature_id)
Buffers: shared hit=4111 read=159
-> Index Scan using feature_pkey on feature feat (cost=0.29..0.53 rows=1 width=15) (actual time=0.021..0.028 rows=1 loops=10536)
Index Cond: (feature_id = featureloc.feature_id)
Buffers: shared hit=31151 read=515
-> Index Scan using biomaterialprop_c1 on biomaterialprop bp1 (cost=0.28..0.52 rows=1 width=12) (actual time=0.041..0.042 rows=1 loops=10606)
Index Cond: ((biomaterial_id = biomaterial.biomaterial_id) AND (type_id = 2916))
Buffers: shared hit=31906 read=13
-> Index Scan using biomaterialprop_c1 on biomaterialprop bp2 (cost=0.28..0.52 rows=1 width=12) (actual time=0.065..0.077 rows=1 loops=10606)
Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2917))
Buffers: shared hit=31923 read=17
-> Index Scan using biomaterialprop_c1 on biomaterialprop bp3 (cost=0.28..0.52 rows=1 width=12) (actual time=0.042..0.050 rows=1 loops=10606)
Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2918))
Buffers: shared hit=31908 read=18
-> Index Scan using biomaterialprop_c1 on biomaterialprop bp4 (cost=0.28..0.52 rows=1 width=12) (actual time=0.027..0.031 rows=1 loops=10606)
Index Cond: ((biomaterial_id = bp1.biomaterial_id) AND (type_id = 2919))
Buffers: shared hit=31877 read=25
-> Hash (cost=5.36..5.36 rows=236 width=61) (actual time=0.254..0.254 rows=236 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 31kB
Buffers: shared hit=2 read=1
-> Seq Scan on contact (cost=0.00..5.36 rows=236 width=61) (actual time=0.003..0.129 rows=236 loops=1)
Buffers: shared hit=2 read=1
Planning time: 160.551 ms
Execution time: 10275.793 ms
可以找到当前的表结构here。我没有对架构进行任何更改。