我有一个查询在截断的数据集上执行我想要的操作,但是当我在完整数据集(数百万行)上运行它时,它需要永远运行。
我有两个表 - microsat_table和coverage_table。
microsat_table:
ImageIcon imageIcon = new ImageIcon(baseDir + dash + "ErrIco.png");
Image image = imageIcon.getImage();
Image newimg = image.getScaledInstance(35, 35, java.awt.Image.SCALE_SMOOTH);
imageIcon = new ImageIcon(newimg);
coverage_table:
+----+----------+-----------+---------+-------------------------------------------------+
| id | Seq_Name | SSR_Start | SSR_End | Sequence |
+----+----------+-----------+---------+-------------------------------------------------+
| 2 | chr2L | 11050 | 11067 | TTTAATTTAATTTAATTT |
| 3 | chr2L | 44173 | 44187 | TATGTATGTATGTAT |
| 5 | chr2L | 54431 | 54477 | ATAATAATATAATATAATATAATATAATATATAATAATATAATAATA |
| 6 | chr2L | 57571 | 57594 | ATATATATATATATATATATATAT |
| 7 | chr2L | 72439 | 72453 | CATACATACATACAT |
| 8 | chr2L | 74028 | 74042 | ATACATACATACATA |
| 9 | chr2L | 85573 | 85587 | ATTTTATTTTATTTT |
| 10 | chr2L | 92429 | 92443 | ACATACATACATACA |
| 11 | chr2L | 138132 | 138166 | TATATAGATATATAAATATATATATATATATATAT |
| 13 | chr2L | 162245 | 162259 | ATACATACATACATA |
+----+----------+-----------+---------+-------------------------------------------------+
我想在microsat_table中添加一个列,用于计算coverage表中Start和Stop值落在microsat_table中SSR_Start和SSR_End值内的所有行的平均coverage(来自coverage_table)。
示例结果:
| Seq_Name | Start | Stop | Coverage |
+----------+-------+-------+----------+
| chr2L | 5716 | 5771 | 1 |
| chr2L | 8730 | 8824 | 1 |
| chr2L | 9894 | 9948 | 1 |
| chr2L | 19391 | 19491 | 1 |
| chr2L | 19575 | 19675 | 1 |
| chr2L | 19773 | 19776 | 1 |
| chr2L | 19776 | 19872 | 2 |
| chr2L | 21920 | 21959 | 1 |
| chr2L | 21959 | 22020 | 2 |
| chr2L | 22020 | 22059 | 1 |
+----------+-------+-------+----------+
我的查询是:
+-----+----------+-----------+---------+--------------------------------+---------+
| id | Seq_Name | SSR_Start | SSR_End | Sequence | avg |
+-----+----------+-----------+---------+--------------------------------+---------+
| 53 | chr2L | 402489 | 402503 | AAAACAAAACAAAAC | 3.0000 |
| 64 | chr2L | 447214 | 447233 | CAGCAGCAGCAGCAGCAGCA | 8.0000 |
| 66 | chr2L | 457839 | 457868 | CAGCAGCAGCAACAGCAGCAGCAGGCAGCA | 2.0000 |
| 105 | chr2L | 579589 | 579603 | TCGAATCGAATCGAA | 11.0000 |
| 123 | chr2L | 628484 | 628501 | TAATGTTAATGTTAATGT | 6.0000 |
+-----+----------+-----------+---------+--------------------------------+---------+
解释截断表的结果:
UPDATE microsat_table
JOIN
(SELECT m.id, SUM(p.Coverage)/count(p.Start)
AS avg FROM microsat_table m
LEFT OUTER JOIN coverage_table p
ON m.Seq_Name LIKE p.Seq_Name
WHERE m.Seq_Name LIKE p.Seq_Name GROUP BY m.id) AS qt
ON microsat_table.id = qt.id
SET microsat_table.avg = qt.avg;
我添加了索引(包括尝试HASH和BTREE索引),这大大提高了它,但是我让它在更大的数据集上运行了1.5天,但仍然没有完成。
有没有人对如何让它跑得更快有任何建议?
谢谢!
答案 0 :(得分:1)
您的代码中存在一些相对较小的不足之处。然而,最大的问题是,当你说你想要计算覆盖表中的Start和Stop值落在microsat_table和#34中的SSR_Start和SSR_End值的所有行的平均覆盖率(来自coverage_table) ;你实际上似乎并没有限制查询这样做。相反,您只在 ---------------------------------------
id | username | ref | email
---------------------------------------
1 | name1 | 0 | name1@email.com
2 | name2 | 1 | name2@email.com
3 | name3 | 0 | name3@email.com
4 | name4 | 3 | name4@email.com
5 | name5 | 3 | name5@email.com
6 | name5 | 0 | name6@email.com
---------------------------------------
上编码匹配。
下面的代码尝试解决这个问题(我使用的delete_transient( 'your_transient_name' );
和Seq_Name
可能不是您需要的)以及其他更小的内容:
>=
答案 1 :(得分:0)
也许在一个大事务中更新表对系统来说太过分了? (您要更新的表的大小是多少?)您可以尝试在块中执行此操作。我也在这里选择一个简单的子选项,看起来更容易阅读恕我直言。
另请注意Steve Lovell的评论,即您的查询似乎并不关心启动/停止列。既然你可能偶然忘记了它,我也在这里添加它,删除它不应该太难=)
DECLARE @min_id int,
@max_id int,
@blocksize int
SELECT @min_id = MIN(id),
@max_id = MAX(id),
@blocksize = 100000 -- adapt as needed
FROM microsat_table
WHILE @min_id <= @max_id
BEGIN
UPDATE microsat_table
SET microsat_table.avg = ((SELECT SUM(p.Coverage)/count(p.Start) AS avg
FROM microsat_table m
LEFT OUTER JOIN coverage_table p
ON m.Seq_Name LIKE p.Seq_Name -- if possble use '=' here instead of LIKE
AND p.Start >= m.SSR_Start -- flagrantly "stolen" from Steve Lovell's answer
AND p.End <= m.SSR_End
WHERE m.id = microsat_table.id)
-- limit update to this block:
WHERE microsat_table.id BETWEEN @min_id AND (@min_id + @blocksize - 1)
-- prepare for next block
SELECT @min_id = @min_id + @blocksize
END
您可能希望id
的{{1}}字段和microsat_table
的{{1}} + Seq_name
列上显示主键。