更新内部连接子查询的数百万条记录 - 优化技术

时间:2013-12-16 00:59:22

标签: mysql sql optimization mysql-5.6

我正在寻找一些关于如何更好地优化此查询的建议。

对于每个_piece_detail记录:

  1. 包含至少一个匹配的_scan记录(zip,zip_4, zip_delivery_point,serial_number)
  2. 属于mailing_groups的公司(通过一系列关系)
  3. 有:
    1. first_scan_date_time大于相关_scan记录的MIN(scan_date_time)
    2. latest_scan_date_time小于MAX(scan_date_time) 相关的_scan记录
  4. 我需要:

    1. _piece_detail.first_scan_date_time设为MIN(_scan.scan_date_time)
    2. _piece_detail.latest_scan_date_time设为MAX(_scan.scan_date_time)
    3. 由于我正在处理数百万条记录,因此我试图减少实际需要搜索的记录数量。以下是有关数据的一些事实:

      1. _piece_details表由job_id分区,所以看起来如此 最有意义的是按顺序运行这些检查 _piece_detail.job_id_piece_detail.piece_id
      2. 扫描记录表现在包含超过100,000,000条记录,并按(zip,zip_4,zip_delivery_point, serial_number,scan_date_time),与使用的密钥相同 将_scan与_piece_detail匹配(除了scan_date_time)。
      3. 只有约40%的_piece_detail条记录属于mailing_group,但在运行之前我们不知道这些记录是哪些 通过连接的完整关系。
      4. 只有约30%的_scan记录属于_piece_detail mailing_group
      5. _scan段通常有0到4 _piece_detail条记录。
      6. 现在,我正在寻找一种以合适的方式执行此操作的方法。我最初是从这样的事情开始的:

        UPDATE _piece_detail
            INNER JOIN (
                SELECT _piece_detail.job_id, _piece_detail.piece_id, MIN(_scan.scan_date_time) as first_scan_date_time, MAX(_scan.scan_date_time) as latest_scan_date_time
                FROM _piece_detail
                    INNER JOIN _container_quantity 
                        ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id 
                        AND _piece_detail.job_id = _container_quantity.job_id
                    INNER JOIN _container_summary 
                        ON _container_quantity.container_id = _container_summary.container_id 
                        AND _container_summary.job_id = _container_quantity.job_id
                    INNER JOIN _mail_piece_unit 
                        ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id 
                        AND _container_quantity.job_id = _mail_piece_unit.job_id
                    INNER JOIN _header 
                        ON _header.job_id = _piece_detail.job_id
                    INNER JOIN mailing_groups 
                        ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
                    INNER JOIN _scan
                        ON _scan.zip = _piece_detail.zip 
                        AND _scan.zip_4 = _piece_detail.zip_4 
                        AND _scan.zip_delivery_point = _piece_detail.zip_delivery_point 
                        AND _scan.serial_number = _piece_detail.serial_number 
                GROUP BY _piece_detail.job_id, _piece_detail.piece_id, _scan.zip, _scan.zip_4, _scan.zip_delivery_point, _scan.serial_number
            ) as t1 ON _piece_detail.job_id = t1.job_id AND _piece_detail.piece_id = t1.piece_id 
        SET _piece_detail.first_scan_date_time = t1.first_scan_date_time, _piece_detail.latest_scan_date_time = t1.latest_scan_date_time
        WHERE _piece_detail.first_scan_date_time < t1.first_scan_date_time 
            OR _piece_detail.latest_scan_date_time > t1.latest_scan_date_time;
        

        我认为这可能是一次尝试加载到内存中太多而且可能没有正确使用索引。

        然后我认为我可以避免做那个巨大的连接子查询并添加两个leftjoin子查询来获得min / max,如下所示:

        UPDATE _piece_detail
            INNER JOIN _container_quantity 
                ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id 
                AND _piece_detail.job_id = _container_quantity.job_id
            INNER JOIN _container_summary 
                ON _container_quantity.container_id = _container_summary.container_id 
                AND _container_summary.job_id = _container_quantity.job_id
            INNER JOIN _mail_piece_unit 
                ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id 
                AND _container_quantity.job_id = _mail_piece_unit.job_id
            INNER JOIN _header 
                ON _header.job_id = _piece_detail.job_id
            INNER JOIN mailing_groups 
                ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
            LEFT JOIN _scan fs ON (fs.zip, fs.zip_4, fs.zip_delivery_point, fs.serial_number) = (
                SELECT zip, zip_4, zip_delivery_point, serial_number
                FROM _scan
                WHERE zip = _piece_detail.zip 
                    AND zip_4 = _piece_detail.zip_4 
                    AND zip_delivery_point = _piece_detail.zip_delivery_point 
                    AND serial_number = _piece_detail.serial_number
                ORDER BY scan_date_time ASC
                LIMIT 1
                )
            LEFT JOIN _scan ls ON (ls.zip, ls.zip_4, ls.zip_delivery_point, ls.serial_number) = (
                SELECT zip, zip_4, zip_delivery_point, serial_number
                FROM _scan
                WHERE zip = _piece_detail.zip 
                    AND zip_4 = _piece_detail.zip_4 
                    AND zip_delivery_point = _piece_detail.zip_delivery_point 
                    AND serial_number = _piece_detail.serial_number
                ORDER BY scan_date_time DESC
                LIMIT 1
                )
        SET _piece_detail.first_scan_date_time = fs.scan_date_time, _piece_detail.latest_scan_date_time = ls.scan_date_time
        WHERE _piece_detail.first_scan_date_time < fs.scan_date_time 
            OR _piece_detail.latest_scan_date_time > ls.scan_date_time
        

        这些是我将它们转换为SELECT语句时的解释:

        +----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
        | id | select_type | table               | type   | possible_keys                                      | key           | key_len | ref                                                                                                                    | rows   | Extra                                        |
        +----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
        |  1 | PRIMARY     | <derived2>          | ALL    | NULL                                               | NULL          | NULL    | NULL                                                                                                                   | 844161 | NULL                                         |
        |  1 | PRIMARY     | _piece_detail       | eq_ref | PRIMARY,first_scan_date_time,latest_scan_date_time | PRIMARY       | 18      | t1.job_id,t1.piece_id                                                                                                  |      1 | Using where                                  |
        |  2 | DERIVED     | _header             | index  | PRIMARY                                            | date_prepared | 3       | NULL                                                                                                                   |     87 | Using index; Using temporary; Using filesort |
        |  2 | DERIVED     | _piece_detail       | ref    | PRIMARY,cqt_database_id,zip                        | PRIMARY       | 10      | odms._header.job_id                                                                                                    |   9703 | NULL                                         |
        |  2 | DERIVED     | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity     | unique        | 14      | odms._header.job_id,odms._piece_detail.cqt_database_id                                                                 |      1 | NULL                                         |
        |  2 | DERIVED     | _mail_piece_unit    | eq_ref | PRIMARY,company,job_id_mail_piece_unit             | PRIMARY       | 14      | odms._container_quantity.mpu_id,odms._header.job_id                                                                    |      1 | Using where                                  |
        |  2 | DERIVED     | mailing_groups      | eq_ref | PRIMARY                                            | PRIMARY       | 27      | odms._mail_piece_unit.mpu_company                                                                                      |      1 | Using index                                  |
        |  2 | DERIVED     | _container_summary  | eq_ref | unique,container_id,job_id_container_summary       | unique        | 14      | odms._header.job_id,odms._container_quantity.container_id                                                              |      1 | Using index                                  |
        |  2 | DERIVED     | _scan               | ref    | PRIMARY                                            | PRIMARY       | 28      | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number |      1 | Using index                                  |
        +----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
        
        +----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
        | id | select_type        | table               | type   | possible_keys                                                      | key           | key_len | ref                                                                                                                    | rows      | Extra                                                           |
        +----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
        |  1 | PRIMARY            | _header             | index  | PRIMARY                                                            | date_prepared | 3       | NULL                                                                                                                   |        87 | Using index                                                     |
        |  1 | PRIMARY            | _piece_detail       | ref    | PRIMARY,cqt_database_id,first_scan_date_time,latest_scan_date_time | PRIMARY       | 10      | odms._header.job_id                                                                                                    |      9703 | NULL                                                            |
        |  1 | PRIMARY            | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity                     | unique        | 14      | odms._header.job_id,odms._piece_detail.cqt_database_id                                                                 |         1 | NULL                                                            |
        |  1 | PRIMARY            | _mail_piece_unit    | eq_ref | PRIMARY,company,job_id_mail_piece_unit                             | PRIMARY       | 14      | odms._container_quantity.mpu_id,odms._header.job_id                                                                    |         1 | Using where                                                     |
        |  1 | PRIMARY            | mailing_groups      | eq_ref | PRIMARY                                                            | PRIMARY       | 27      | odms._mail_piece_unit.mpu_company                                                                                      |         1 | Using index                                                     |
        |  1 | PRIMARY            | _container_summary  | eq_ref | unique,container_id,job_id_container_summary                       | unique        | 14      | odms._header.job_id,odms._container_quantity.container_id                                                              |         1 | Using index                                                     |
        |  1 | PRIMARY            | fs                  | index  | NULL                                                               | updated       | 1       | NULL                                                                                                                   | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
        |  1 | PRIMARY            | ls                  | index  | NULL                                                               | updated       | 1       | NULL                                                                                                                   | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
        |  3 | DEPENDENT SUBQUERY | _scan               | ref    | PRIMARY                                                            | PRIMARY       | 28      | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number |         1 | Using where; Using index; Using filesort                        |
        |  2 | DEPENDENT SUBQUERY | _scan               | ref    | PRIMARY                                                            | PRIMARY       | 28      | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number |         1 | Using where; Using index; Using filesort                        |
        +----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
        

        现在,看看每个产生的解释,我真的不知道哪个给了我最好的回报。第一个显示乘以行列时总行数较少,但第二个似乎执行得更快一点。

        在通过修改查询结构来提高性能的同时,我能做些什么来实现相同的结果?

4 个答案:

答案 0 :(得分:1)

执行批量更新时禁用索引更新

ALTER TABLE _piece_detail DISABLE KEYS;

UPDATE ....;

ALTER TABLE _piece_detail ENABLE KEYS;

请参阅mysql文档:http://dev.mysql.com/doc/refman/5.0/en/alter-table.html

编辑: 在查看我指出的mysql文档之后,我看到文档为MyISAM表指定了这个,并且对于其他表类型是明确的。此处有更多解决方案:How to disable index in innodb

答案 1 :(得分:1)

有一些我被教过的东西,我严格遵循直到今天 - 创建尽可能多的临时表,同时避免使用派生表。特别是在UPDATE / DELETE / INSERTs为

的情况下
  1. 您无法预测派生表的索引
  2. 如果结果集很大,派生表可能不会保留在内存中
  3. 每次运行派生查询时,表(MyIsam)/行(Innodb)可能会被锁定更长的时间。我更喜欢临时表,它具有与父表的主键连接。
  4. 最重要的是,它使代码看起来整洁可读。

    我的方法将是

    CREATE table temp xxx(...)
    INSERT INTO xxx select q from y inner join z....;
    UPDATE _piece_detail INNER JOIN xxx on (...) SET ...;
    

    始终减少您的停机时间!!

答案 2 :(得分:0)

为什么不对每个联接使用子查询?包括内连接?

INNER JOIN (SELECT field1, field2, field 3 from _container_quantity order by 1,2,3) 
    ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id 
    AND _piece_detail.job_id = _container_quantity.job_id
INNER JOIN (SELECT field1, field2, field3 from _container_summary order by 1,2,3)
    ON _container_quantity.container_id = _container_summary.container_id 
    AND _container_summary.job_id = _container_quantity.job_id

通过不限制你对这些内连接的选择,你肯定会大量投入内存。通过在每个子查询的末尾使用1,2,3的顺序,您可以在每个子查询上创建索引。你唯一的索引是在标题上,你不能加入_headers ....

一些优化此查询的建议。在每个表上创建所需的索引,或使用子查询连接子句手动创建动态所需的索引。

另外请记住,当您在&#34;临时&#34;上进行左连接时表满是聚合,你只是要求性能问题。

  

包含至少一个匹配的_scan记录(zip,zip_4,   zip_delivery_point,serial_number)

嗯......这是你想要做的第一点,但这些字段都没有编入索引?

答案 3 :(得分:0)

从您的解释结果看,子查询似乎经过了两次所有行,那么你如何保持MIN / MAX不是第一个,而只使用一个左连接而不是两个?