删除大量数据和主索引

时间:2014-05-27 11:06:22

标签: php mysql innodb bigdata clustered-index

我试图从带有主/聚簇索引的InnoDB MySQL表中删除大量行(> 1000万,大约是表中所有记录的1/3)。 字段id是主要/聚集索引,它是连续的,没有间隙。至少它应该是,我不删除中间的记录。但是有些插入查询可能会失败并且innodb会分配一些未使用的ID(我不确定它是否属实)。我只删除不再需要的旧记录。表包含varchar列,因此行没有固定大小。

首先我的尝试:

DELETE FROM `table` WHERE id<=10000000

大型io操作失败了。似乎是mysql杀死了这个查询并回滚了所有更改。查询执行的大概时间为6小时,回滚大致相同。 我最大的错误是事务日志大小是标准的5mb,请注意它。它必须扩大。

第二次尝试删除10 000条记录,例如:

DELETE FROM `table` WHERE id<=10000;
COMMIT;
DELETE FROM `table` WHERE id<=20000;
COMMIT;

等等。从头开始查询执行时间约为10秒(在笔记本电脑上)。执行时间逐渐增加,执行6小时后每个查询大约300秒。

第三次尝试进行平均执行时间小于1秒的查询。 php代码:

protected function deleteById($table, $id) {
    $MinId          = $this->getMinFromTable($table, 'id');
    $PackDeleteCount= $this->PackDeleteCount;
    $timerTotal     = new Timer();
    $delCountTotal  = 0;
    $delCountReport = 0;
    $delInfo        = array();
    $PackMinTime    = round($this->PackDeleteTime - $this->PackDeleteTime*$this->PackDeleteDiv, 3);
    $PackMaxTime    = round($this->PackDeleteTime + $this->PackDeleteTime*$this->PackDeleteDiv, 3);
    $this->LogString(sprintf('Del `%s`, PackMinTime: %s; PackMaxTime: %s', $table, $PackMinTime, $PackMaxTime));
    for (; $MinId < $id;) {
        $MinId          += $PackDeleteCount;
        $delCountReport += $PackDeleteCount;
        if ($MinId > $id) {
            $MinId = $id;
        }
        $timer          = new Timer();
        $sql            = sprintf('DELETE FROM `%s` WHERE id<=%s', $table, $MinId);
        $this->s->Query($sql, __FILE__, __LINE__);
        $delCount       = $this->s->AffectedRows();
        $this->s->CommitT();
        $RoundTime      = round($timer->end(), 3);
        $delInfo[]      = array(
            'time'  => $RoundTime,
            'rows'  => $PackDeleteCount,
        );
        $delCountTotal  += $delCount;
        if ($delCountReport >= $this->PackDeleteReport) {
            $delCountReport = 0;
            $delSqlCount    = count($delInfo);
            $EvTime         = 0;
            $PackTime       = 0;
            $EvCount        = 0;
            $PackCount      = 0;
            foreach ($delInfo as $v) {
                $PackTime   += $v['time'];
                $PackCount  += $v['rows'];
            }
            $EvTime         = round($PackTime/$delSqlCount, 2);
            $PackTime       = round($PackTime, 2);
            $EvCount        = round($PackCount/$delSqlCount);
            $TotalTime      = $this->readableTime(intval($timerTotal->end()));
            $this->LogString(sprintf('Del `%s`, Sql query count: %d; Time: %s; Count: %d; Evarage Time %s; Evarage count per delete: %d; Del total: %s; Del Total Time: %s; id <= %s', $table, $delSqlCount, $PackTime, $PackCount, $EvTime, $EvCount, $delCountTotal, $TotalTime, $MinId));
            $delInfo        = array();
        }

        $PackDeleteCountOld = $PackDeleteCount;
        if ($RoundTime < $PackMinTime) {
            $PackDeleteCount    = intval($PackDeleteCount + $PackDeleteCount*(1 - $RoundTime/$this->PackDeleteTime));
        } elseif ($RoundTime > $PackMaxTime) {
            $PackDeleteCount    = intval($PackDeleteCount - $PackDeleteCount*(1 - $this->PackDeleteTime/$RoundTime));
        }
        //$this->LogString(sprintf('Del `%s`, round time: %s; row count old: %d; row count new: %d', $table, $RoundTime, $PackDeleteCountOld, $PackDeleteCount));
    }
    $this->LogString(sprintf('Finished del `%s`: time: %s', $table, round($timerTotal->end(), 2)));
}

它有一些依赖关系,但它们是自我解释的,可以很容易地用标准更改。 我只解释这里使用的输入变量:

$table - target table, where rows needs to be deleted
$id - all records up to this id should be deleted
$MinId - Minimal id in the target table
$this->PackDeleteCount - Initial count of records, to start from. Then it recalculates row count to be deleted each new query.
$this->PackDeleteTime - desirable query execution time in average. I used 0.5
$this->PackDeleteDiv - acceptable deviation from $this->PackDeleteTime. In percentage. I used 0.3
$this->PackDeleteReport - Each N records should print statistic information about deleting

此变体具有稳定的性能。

性能不佳的原因是数据库引擎必须在受影响的叶子中物理地获取所有记录数据。这是我的理解,如果您的知识更深入,欢迎您详细说明实际情况。也许它会带来一些新的想法。

问题:是否可以计算叶子上的行分布并删除整个假或甚至分支,因此数据库引擎不必求助数据? 对于这种情况,您可能还有其他一些关于性能优化的想法。

2 个答案:

答案 0 :(得分:0)

我已经面对过几次了,通常我会按照创建分区(或者几个首先)的方式进行操作,因为这样可以减少INNODB对大型删除查询所需的IO,而无需重建整个索引树 - 然后块一次删除1000到1500之间。

这也是练习:

  • 将AutoComit设置为1
  • 每次删除大约1,500个
  • 确保innodb_log_file_size具有足够大的尺寸

答案 1 :(得分:0)

尝试

DELETE FROM `table` WHERE id BETWEEN 1 AND 10000000