Question

我有两个名为table_1（1GB）和引用（250Mb）的表。

当我在引用上查询交叉连接时，需要16小时才能更新table_1 ..我们更改了系统文件EXT3 for XFS但仍然需要16小时..我做错了什么？

以下是更新/交叉连接查询：

  mysql> UPDATE table_1 CROSS JOIN reference ON
  -> (table_1.start >= reference.txStart AND table_1.end <= reference.txEnd)
  -> SET table_1.name = reference.name;
  Query OK, 17311434 rows affected (16 hours 36 min 48.62 sec)
  Rows matched: 17311434  Changed: 17311434  Warnings: 0

这是table_1的show create table和reference：

    CREATE TABLE `table_1` (
     `strand` char(1) DEFAULT NULL,
     `chr` varchar(10) DEFAULT NULL,
     `start` int(11) DEFAULT NULL,
     `end` int(11) DEFAULT NULL,
     `name` varchar(255) DEFAULT NULL,
     `name2` varchar(255) DEFAULT NULL,
     KEY `annot` (`start`,`end`)
   ) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;


   CREATE TABLE `reference` (
     `bin` smallint(5) unsigned NOT NULL,
     `name` varchar(255) NOT NULL,
     `chrom` varchar(255) NOT NULL,
     `strand` char(1) NOT NULL,
     `txStart` int(10) unsigned NOT NULL,
     `txEnd` int(10) unsigned NOT NULL,
     `cdsStart` int(10) unsigned NOT NULL,
     `cdsEnd` int(10) unsigned NOT NULL,
     `exonCount` int(10) unsigned NOT NULL,
     `exonStarts` longblob NOT NULL,
     `exonEnds` longblob NOT NULL,
     `score` int(11) DEFAULT NULL,
     `name2` varchar(255) NOT NULL,
     `cdsStartStat` enum('none','unk','incmpl','cmpl') NOT NULL,
     `cdsEndStat` enum('none','unk','incmpl','cmpl') NOT NULL,
     `exonFrames` longblob NOT NULL,
      KEY `chrom` (`chrom`,`bin`),
      KEY `name` (`name`),
      KEY `name2` (`name2`),
      KEY `annot` (`txStart`,`txEnd`)
   ) ENGINE=MyISAM DEFAULT CHARSET=latin1 ;

Answer 1

您应该为table_1.start，reference.txStart，table_1.end和reference.txEnd表字段编制索引：

ALTER TABLE `table_1` ADD INDEX ( `start` ) ;
ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txStart` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;

Answer 2

交叉连接是笛卡尔积，它可能是计算成本最高的东西之一（它们不能很好地扩展）。

对于i = 1到n的每个表T_i，通过交叉表T_1到T_n生成的行数是每个表的大小乘以每个其他表的大小，即

| T_1 | * | T_2 | * ... * | T_n |

假设每个表都有M行，那么计算交叉连接的结果成本就是

M_1 * M_2 ... M_n = O（M ^ n）

这是连接中涉及的表数量的指数。

Answer 3

试试这个：

UPDATE table_1 SET
table_1.name = (
  select reference.name
  from reference
  where table_1.start >= reference.txStart
  and table_1.end <= reference.txEnd)

Answer 4

我发现UPDATE语句存在2个问题。

End字段没有索引。您拥有的复合索引（annot）将仅用于此查询中的start字段。您应该按照Emre的建议添加它们：

ALTER TABLE `table_1` ADD INDEX ( `end` ) ;
ALTER TABLE `reference` ADD INDEX ( `txEnd` ) ;

其次，JOIN可能（并且可能确实）找到许多与reference行相关的表table_1行。因此，更新的一些（或所有）table_1行会更新多次。检查此查询的结果，看它是否与更新的行数（17311434）相同：

SELECT COUNT(*)
FROM table_1
  WHERE EXISTS
    ( SELECT *
      FROM reference
      WHERE table_1.start >= reference.txStart
        AND table_1.`end` <= reference.txEnd
    )

可以有其他方式来编写此查询，但两个表上缺少PRIMARY KEY会使其更难。如果您在table_1上定义主键，请尝试此操作，将id替换为主键。

更新：不，请勿在包含34M行的表上尝试。检查执行计划并首先尝试使用较小的表。

UPDATE table_1 AS t1
  JOIN 
    ( SELECT t2.id
           , r.name
      FROM table_1 AS t2
        JOIN
          ( SELECT name, txStart, txEnd
            FROM reference
            GROUP BY txStart, txEnd
          ) AS r
          ON  t2.start >= r.txStart
          AND t2.`end` <= r.txEnd
      GROUP BY t2.id
    ) AS good
    ON good.id = t1.id
SET t1.name = good.name;

您可以通过在等效的SELECT：

上运行EXPLAIN来检查查询计划

EXPLAIN
SELECT t1.id, t1.name, good.name
FROM table_1 AS t1
  JOIN 
    ( SELECT t2.id
           , r.name
      FROM table_1 AS t2
        JOIN
          ( SELECT name, txStart, txEnd
            FROM reference
            GROUP BY txStart, txEnd
          ) AS r
          ON  t2.start >= r.txStart
          AND t2.`end` <= r.txEnd
      GROUP BY t2.id
    ) AS good
    ON good.id = t1.id ;

Answer 5

有人已经提议你添加一些索引。但我认为使用这两个索引可以获得最佳性能：

ALTER TABLE `test`.`time` 
    ADD INDEX `reference_start_end` (`txStart` ASC, `txEnd` ASC),
    ADD INDEX `table_1_star_end` (`start` ASC, `end` ASC);

MySQL查询只会使用其中一个，但MySQL会自动决定哪个更有用。

使用CROSS JOIN进行超慢查询

5 个答案: