我有一个包含以下列的表:
RECORD_ID
SOURCE_ID
USER_ID
移动
called_at
我正在尝试运行这两个查询
SELECT
t1.user_id,
t1.mobile,
COUNT(DISTINCT(t1.called_at )) AS cnt
FROM
(
SELECT
user_id,
mobile,
called_at
FROM
users
WHERE
called_at >= "2016-09-01" AND called_at < "2016-12-01" and user_id is NOT NULL
) t1
GROUP BY t1.user_id, t1.mobile
HAVING cnt > 1
和
SELECT
user_id,
mobile,
COUNT(DISTINCT(called_at )) AS cnt
FROM users
WHERE called_at >= "2016-09-01" AND called_at < "2016-12-01" and user_id is NOT NULL
GROUP BY user_id, mobile
HAVING cnt > 1
两个查询在逻辑上相同,并且也提供相同的输出。但是第一个查询运行非常快~3秒,第二个查询运行〜55秒。
甚至解释说第一个查询涉及使用filesort对派生表进行额外扫描,但速度要快得多。
这怎么可能?
解释输出:
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1025150 | Using filesort |
| 2 | DERIVED | users | ALL | idx_fa_af,idx_a_di_um | NULL | NULL | NULL | 2221923 | Using where |
+----+-------------+-----------------------+------+-----------------------+------+---------+------+---------+----------------+
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
| 1 | SIMPLE | users | index | idx_fa_af,idx_a_di_um | idx_a_di_um | 23 | NULL | 2221923 | Using where |
+----+-------------+-----------------------+-------+-----------------------+-------------+---------+------+---------+-------------+
| users | CREATE TABLE `users` (
`record_id` varchar(100) NOT NULL,
`source_id` int(11) NOT NULL,
`user_id` int(11) DEFAULT NULL,
`mobile` varchar(15) DEFAULT NULL,
`updated_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`called_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
UNIQUE KEY `idx_unique_a_ri_si` (`record_id`,`source_id`),
KEY `idx_fa_af` (`called_at`),
KEY `idx_fa_um` (`mobile`),
KEY `idx_a_di_um` (`user_id`,`mobile`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+----------------------------+---------+
| Variable_name | Value |
+----------------------------+---------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 2 |
| Handler_mrr_init | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 1 |
| Handler_read_last | 0 |
| Handler_read_next | 0 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 3676447 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 0 |
| Handler_write | 1208173 |
+----------------------------+---------+
+----------------------------+---------+
| Variable_name | Value |
+----------------------------+---------+
| Handler_commit | 1 |
| Handler_delete | 0 |
| Handler_discover | 0 |
| Handler_external_lock | 2 |
| Handler_mrr_init | 0 |
| Handler_prepare | 0 |
| Handler_read_first | 1 |
| Handler_read_key | 1 |
| Handler_read_last | 0 |
| Handler_read_next | 2468272 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 0 |
| Handler_rollback | 0 |
| Handler_savepoint | 0 |
| Handler_savepoint_rollback | 0 |
| Handler_update | 0 |
| Handler_write | 0 |
+----------------------------+---------+
答案 0 :(得分:1)
添加INDEX(user_id, called_at, mobile)
,然后运行两次查询。两次是为了避免缓存可能隐藏I / O的问题。
我怀疑第一个查询运行得很快,因为它全部在RAM中。第二个是使用未缓存的索引idx_a_di_um
。
我建议的索引应该使它们都运行得更快。
列的任何组合是“独特的”吗?如果是,请将组合设为PRIMARY KEY
。这将进一步改善。如果没有,至少提供id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY
。
为什么会有所帮助
索引是BTree。 (请参阅维基百科以获得良好的定义。)该索引结构与数据分开,数据位于单独的BTree中,由PRIMARY KEY
排序。 BTree在查找一行或一组连续行方面非常有效。 (根据索引“连续”。)当使用辅助密钥(即,不是PRIMARY
)时,首先找到索引的行,然后每个数据行是使用PRIMARY KEY
查找。除非......如果 all SELECT
中所需的列位于辅助密钥中,则无需覆盖数据。这被称为'覆盖'; EXPLAIN
通过说“使用索引”表示它。我的索引是子查询的“覆盖”索引。
任何索引中列的顺序都很重要。在这种情况下,索引将所有user_id IS NOT NULL
行放在一起。但这是关于3列顺序的唯一论据。
处理程序技巧
这是一种了解查询正在做什么的方法,它不依赖于缓存,服务器重启等:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
表格大小(行)的数字表示表格(或索引)扫描。看起来像输出大小的数字表示一些最终操作。 Handler_write ...表示tmp表。等