有没有办法优化这个mysql查询(更新,多个连接)?

时间:2017-04-22 04:34:00

标签: mysql sql query-optimization

我有一个查询在截断的数据集上执行我想要的操作,但是当我在完整数据集(数百万行)上运行它时,它需要永远运行。

我有两个表 - microsat_table和coverage_table。

microsat_table:

        ImageIcon imageIcon = new ImageIcon(baseDir + dash + "ErrIco.png");
        Image image = imageIcon.getImage();
        Image newimg = image.getScaledInstance(35, 35,  java.awt.Image.SCALE_SMOOTH);
        imageIcon = new ImageIcon(newimg);

coverage_table:

+----+----------+-----------+---------+-------------------------------------------------+
| id | Seq_Name | SSR_Start | SSR_End | Sequence                                        |
+----+----------+-----------+---------+-------------------------------------------------+
|  2 | chr2L    |     11050 |   11067 | TTTAATTTAATTTAATTT                              |
|  3 | chr2L    |     44173 |   44187 | TATGTATGTATGTAT                                 |
|  5 | chr2L    |     54431 |   54477 | ATAATAATATAATATAATATAATATAATATATAATAATATAATAATA |
|  6 | chr2L    |     57571 |   57594 | ATATATATATATATATATATATAT                        |
|  7 | chr2L    |     72439 |   72453 | CATACATACATACAT                                 |
|  8 | chr2L    |     74028 |   74042 | ATACATACATACATA                                 |
|  9 | chr2L    |     85573 |   85587 | ATTTTATTTTATTTT                                 |
| 10 | chr2L    |     92429 |   92443 | ACATACATACATACA                                 |
| 11 | chr2L    |    138132 |  138166 | TATATAGATATATAAATATATATATATATATATAT             |
| 13 | chr2L    |    162245 |  162259 | ATACATACATACATA                                 |
+----+----------+-----------+---------+-------------------------------------------------+

我想在microsat_table中添加一个列,用于计算coverage表中Start和Stop值落在microsat_table中SSR_Start和SSR_End值内的所有行的平均coverage(来自coverage_table)。

示例结果:

| Seq_Name | Start | Stop  | Coverage |
+----------+-------+-------+----------+
| chr2L    |  5716 |  5771 |        1 |
| chr2L    |  8730 |  8824 |        1 |
| chr2L    |  9894 |  9948 |        1 |
| chr2L    | 19391 | 19491 |        1 |
| chr2L    | 19575 | 19675 |        1 |
| chr2L    | 19773 | 19776 |        1 |
| chr2L    | 19776 | 19872 |        2 |
| chr2L    | 21920 | 21959 |        1 |
| chr2L    | 21959 | 22020 |        2 |
| chr2L    | 22020 | 22059 |        1 |
+----------+-------+-------+----------+

我的查询是:

+-----+----------+-----------+---------+--------------------------------+---------+
| id  | Seq_Name | SSR_Start | SSR_End | Sequence                       | avg     |
+-----+----------+-----------+---------+--------------------------------+---------+
|  53 | chr2L    |    402489 |  402503 | AAAACAAAACAAAAC                |  3.0000 |
|  64 | chr2L    |    447214 |  447233 | CAGCAGCAGCAGCAGCAGCA           |  8.0000 |
|  66 | chr2L    |    457839 |  457868 | CAGCAGCAGCAACAGCAGCAGCAGGCAGCA |  2.0000 |
| 105 | chr2L    |    579589 |  579603 | TCGAATCGAATCGAA                | 11.0000 |
| 123 | chr2L    |    628484 |  628501 | TAATGTTAATGTTAATGT             |  6.0000 |
+-----+----------+-----------+---------+--------------------------------+---------+

解释截断表的结果:

UPDATE microsat_table
JOIN 
   (SELECT m.id, SUM(p.Coverage)/count(p.Start) 
      AS avg FROM microsat_table m  
      LEFT OUTER JOIN coverage_table p 
      ON m.Seq_Name LIKE p.Seq_Name 
      WHERE m.Seq_Name LIKE p.Seq_Name GROUP BY m.id) AS qt 
ON microsat_table.id = qt.id 
SET microsat_table.avg = qt.avg; 

我添加了索引(包括尝试HASH和BTREE索引),这大大提高了它,但是我让它在更大的数据集上运行了1.5天,但仍然没有完成。

有没有人对如何让它跑得更快有任何建议?

谢谢!

2 个答案:

答案 0 :(得分:1)

您的代码中存在一些相对较小的不足之处。然而,最大的问题是,当你说你想要计算覆盖表中的Start和Stop值落在microsat_table和#34中的SSR_Start和SSR_End值的所有行的平均覆盖率(来自coverage_table) ;你实际上似乎并没有限制查询这样做。相反,您只在 --------------------------------------- id | username | ref | email --------------------------------------- 1 | name1 | 0 | name1@email.com 2 | name2 | 1 | name2@email.com 3 | name3 | 0 | name3@email.com 4 | name4 | 3 | name4@email.com 5 | name5 | 3 | name5@email.com 6 | name5 | 0 | name6@email.com --------------------------------------- 上编码匹配。

下面的代码尝试解决这个问题(我使用的delete_transient( 'your_transient_name' ); Seq_Name可能不是您需要的)以及其他更小的内容:

>=

答案 1 :(得分:0)

也许在一个大事务中更新表对系统来说太过分了? (您要更新的表的大小是多少?)您可以尝试在块中执行此操作。我也在这里选择一个简单的子选项,看起来更容易阅读恕我直言。

另请注意Steve Lovell的评论,即您的查询似乎并不关心启动/停止列。既然你可能偶然忘记了它,我也在这里添加它,删除它不应该太难=)

DECLARE @min_id int,
        @max_id int,
        @blocksize int

SELECT @min_id = MIN(id),
       @max_id = MAX(id),
       @blocksize = 100000 -- adapt as needed
  FROM microsat_table

WHILE @min_id <= @max_id
    BEGIN

        UPDATE microsat_table
           SET microsat_table.avg = ((SELECT SUM(p.Coverage)/count(p.Start) AS avg 
                                        FROM microsat_table m  
                                        LEFT OUTER JOIN coverage_table p 
                                                     ON m.Seq_Name LIKE p.Seq_Name -- if possble use '=' here instead of LIKE
                                                    AND p.Start >= m.SSR_Start -- flagrantly "stolen" from Steve Lovell's answer
                                                    AND p.End   <= m.SSR_End
                                       WHERE m.id = microsat_table.id) 
        -- limit update to this block:
         WHERE microsat_table.id BETWEEN @min_id AND (@min_id + @blocksize - 1)

        -- prepare for next block
        SELECT @min_id = @min_id + @blocksize
    END

您可能希望id的{​​{1}}字段和microsat_table的{​​{1}} + Seq_name列上显示主键。