在阅读了很多关于如何在一个表中找到值而不是在另一个表中找到值的线程之后(SQL - find records from one table which don't exist in another)我试图决定使用最有效的查询。
不幸的是,我正在使用的两个表没有唯一的密钥,宽30多列,长度为2-30万行(通常在2-6百万行),并且对于要匹配的行,所有值都在该行必须匹配。
到目前为止,经过无数个小时的搜索,我发现基本上有以下4种形式的查询。下面的前3个查询和表是从链接的线程及其接受的答案中引用的:
Phone_book
+----+------+--------------+
| id | name | phone_number |
+----+------+--------------+
| 1 | John | 111111111111 |
+----+------+--------------+
| 2 | Jane | 222222222222 |
+----+------+--------------+
Call
+----+------+--------------+
| id | date | phone_number |
+----+------+--------------+
| 1 | 0945 | 111111111111 |
+----+------+--------------+
| 2 | 0950 | 222222222222 |
+----+------+--------------+
| 3 | 1045 | 333333333333 |
+----+------+--------------+
查询1
SELECT *
FROM Call
WHERE phone_number NOT IN (SELECT phone_number FROM Phone_book)
查询2
SELECT *
FROM Call
WHERE NOT EXISTS
(SELECT *
FROM Phone_book
WHERE Phone_book.phone_number = Call.phone_number)
查询3
SELECT *
FROM Call
LEFT OUTER JOIN Phone_Book
ON (Call.phone_number = Phone_book.phone_number)
WHERE Phone_book.phone_number IS NULL
除了这3个查询,我还尝试了这个:
查询4
SELECT a.* FROM CALL a NATURAL LEFT JOIN Phone_book b WHERE b.id IS NULL
查询4速度慢得令人无法接受,但此处仅供参考。我正在测试其他人,但这需要一些时间。
在我的特定应用程序中,我还将结果(在Call中找到但在Phone_book中找不到的行)添加到他们自己的表中。这导致了三个问题:
哪个查询最快?
当强制比较行中的所有列值而不仅仅是phone_number时,哪个查询最快?
哪个查询与insert语句结合使用时速度最快,不会占用大量的结果缓冲区空间?
编辑: 根据要求,这里有一些与我的具体设置有关的信息。我从头开始并删除了之前放入的索引。为了测试上面的查询,我在OldImportResults和NewImportResults中索引了F0。
OldImportResults的表创建语句
CREATE TABLE `OldImportResults` (
`F0` varchar(9) DEFAULT NULL,
`F1` varchar(1) DEFAULT NULL,
`F2` varchar(3) DEFAULT NULL,
`F3` varchar(1) DEFAULT NULL,
`F4` bigint(11) DEFAULT NULL,
`F5` varchar(3) DEFAULT NULL,
`F6` varchar(3) DEFAULT NULL,
`F7` varchar(118) DEFAULT NULL,
`F8` varchar(30) DEFAULT NULL,
`F9` varchar(2) DEFAULT NULL,
`F10` varchar(9) DEFAULT NULL,
`F11` varchar(38) DEFAULT NULL,
`F12` varchar(38) DEFAULT NULL,
`F13` varchar(8) DEFAULT NULL,
`F14` int(8) DEFAULT NULL,
`F15` varchar(9) DEFAULT NULL,
`F16` varchar(25) DEFAULT NULL,
`F17` varchar(8) DEFAULT NULL,
`F18` varchar(1) DEFAULT NULL,
`F19` varchar(100) DEFAULT NULL,
`F20` bigint(19) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
NewImportResults的表创建语句:
CREATE TABLE `NewImportResults` (
`F0` varchar(11) DEFAULT NULL,
`F1` varchar(1) DEFAULT NULL,
`F2` varchar(3) DEFAULT NULL,
`F3` varchar(1) DEFAULT NULL,
`F4` bigint(11) DEFAULT NULL,
`F5` varchar(3) DEFAULT NULL,
`F6` varchar(3) DEFAULT NULL,
`F7` varchar(110) DEFAULT NULL,
`F8` varchar(30) DEFAULT NULL,
`F9` varchar(2) DEFAULT NULL,
`F10` varchar(9) DEFAULT NULL,
`F11` varchar(33) DEFAULT NULL,
`F12` varchar(34) DEFAULT NULL,
`F13` varchar(8) DEFAULT NULL,
`F14` int(8) DEFAULT NULL,
`F15` varchar(9) DEFAULT NULL,
`F16` varchar(25) DEFAULT NULL,
`F17` varchar(8) DEFAULT NULL,
`F18` varchar(1) DEFAULT NULL,
`F19` varchar(100) DEFAULT NULL,
`F20` bigint(19) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
HistoricImportResults的表创建语句
CREATE TABLE `HistoricImportResults` (
`F0` varchar(9) DEFAULT NULL,
`F1` varchar(1) DEFAULT NULL,
`F2` varchar(3) DEFAULT NULL,
`F3` varchar(1) DEFAULT NULL,
`F4` bigint(11) DEFAULT NULL,
`F5` varchar(3) DEFAULT NULL,
`F6` varchar(3) DEFAULT NULL,
`F7` varchar(118) DEFAULT NULL,
`F8` varchar(30) DEFAULT NULL,
`F9` varchar(2) DEFAULT NULL,
`F10` varchar(9) DEFAULT NULL,
`F11` varchar(38) DEFAULT NULL,
`F12` varchar(38) DEFAULT NULL,
`F13` varchar(8) DEFAULT NULL,
`F14` int(8) DEFAULT NULL,
`F15` varchar(9) DEFAULT NULL,
`F16` varchar(25) DEFAULT NULL,
`F17` varchar(8) DEFAULT NULL,
`F18` varchar(1) DEFAULT NULL,
`F19` varchar(100) DEFAULT NULL,
`F20` bigint(19) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
我想获取OldImportResults中的所有行,但不是NewImportResults中的所有行,并将它们放在HistoricImportResults中。
以下是一些解释及其结果:
EXPLAIN
SELECT * FROM OldImportResults WHERE NOT EXISTS (
SELECT * FROM NewImportResults
WHERE OldImportResults.F0 = NewImportResults.F0
);
+----+--------------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
| 1 | PRIMARY | OldImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | Using where |
| 2 | DEPENDENT SUBQUERY | NewImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | Using where |
+----+--------------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
EXPLAIN
SELECT *
FROM OldImportResults
LEFT OUTER JOIN NewImportResults
ON (OldImportResults.F0 = NewImportResults.F0)
WHERE NewImportResults.F0 IS NULL;
+----+-------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | OldImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | |
| 1 | SIMPLE | NewImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | Using where |
+----+-------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
EXPLAIN
SELECT a.*
FROM OldImportResults a
NATURAL LEFT JOIN NewImportResults b
WHERE b.F0 IS NULL;
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 2074378 | |
| 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 2074378 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
编辑:
我决定在每张桌子上索引F0,如下图所示。
SHOW INDEX FROM OldImportResults;
+----------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| OldImportResults | 1 | F0 | 1 | F0 | A | 6627 | NULL | NULL | YES | BTREE | |
+----------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
SHOW INDEX FROM NewImportResults;
+---------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+---------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| NewImportResults | 1 | F0 | 1 | F0 | A | 6627 | NULL | NULL | YES | BTREE | |
+---------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
在对每个查询执行EXPLAIN时,我得到了这个:
EXPLAIN
SELECT * FROM OldImportResults WHERE NOT EXISTS (
SELECT * FROM NewImportResults
WHERE OldImportResults.F0 = NewImportResults.F0
);
+----+--------------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+--------------------------+
| 1 | PRIMARY | OldImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | Using where |
| 2 | DEPENDENT SUBQUERY | NewImportResults | ref | F0 | F0 | 12 | DBName.OldImportResults.F0 | 313 | Using where; Using index |
+----+--------------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+--------------------------+
EXPLAIN
SELECT *
FROM OldImportResults
LEFT OUTER JOIN NewImportResults
ON (OldImportResults.F0 = NewImportResults.F0)
WHERE NewImportResults.F0 IS NULL;
+----+-------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+-------------+
| 1 | SIMPLE | OldImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | |
| 1 | SIMPLE | NewImportResults | ref | F0 | F0 | 12 | DBName.OldImportResults.F0 | 313 | Using where |
+----+-------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+-------------+
EXPLAIN
SELECT a.*
FROM OldImportResults a
NATURAL LEFT JOIN NewImportResults b
WHERE b.F0 IS NULL;
+----+-------------+-------+------+---------------+------+---------+-----------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-----------------------+---------+-------------+
| 1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 2074378 | |
| 1 | SIMPLE | b | ref | F0 | F0 | 12 | DBName.a.F0 | 313 | Using where |
+----+-------------+-------+------+---------------+------+---------+-----------------------+---------+-------------+
根据EXPLAIN结果,似乎WHERE NOT EXISTS语句在使用索引时执行速度要快得多。执行需要41秒,这对我来说并不可怕。其他的不值得测试。有人认为他们可以以任何方式改善执行时间吗?