通过添加聚合改进了MySQL分组查询性能

时间:2016-11-28 22:00:10

标签: mysql performance group-by

我在MySQL中有以下表格:

CREATE TABLE `events` (
  `pv_name` varchar(60) COLLATE utf8mb4_bin NOT NULL,
  `time_stamp` bigint(20) unsigned NOT NULL,
  `event_type` varchar(40) COLLATE utf8mb4_bin NOT NULL,
  `has_data` tinyint(1) NOT NULL,
  `data` json DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin ROW_FORMAT=COMPRESSED;

ALTER TABLE `events`
 ADD PRIMARY KEY (`pv_name`,`time_stamp`),
 ADD UNIQUE KEY `has_data` (`pv_name`,`has_data`,`time_stamp`);

我正在尝试找到一组不同的pv_names,这些pv_names在两个给定时间之间没有数据。以下两个查询似乎都会返回此信息:

mysql> EXPLAIN SELECT pv_name FROM events
         WHERE has_data = 0
           AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999
         GROUP BY events.pv_name;
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+
| id | select_type | table  | partitions | type  | possible_keys    | key      | key_len | ref  | rows    | filtered | Extra                    |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | index | PRIMARY,has_data | has_data | 251     | NULL | 1855281 |     1.11 | Using where; Using index |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+

mysql> EXPLAIN SELECT pv_name, MAX(events.time_stamp) FROM events
         WHERE has_data = 0
           AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999
         GROUP BY events.pv_name;
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+
| id | select_type | table  | partitions | type  | possible_keys    | key      | key_len | ref  | rows   | filtered | Extra                                 |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+
|  1 | SIMPLE      | events | NULL       | range | PRIMARY,has_data | has_data | 251     | NULL | 203123 |   100.00 | Using where; Using index for group-by |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+

我不明白为什么第二个查询对它返回的内容(我不需要)有额外的限制,似乎比第一个查询的运行时间短。有没有办法在time_stamp列上没有聚合的情况下改进第一个查询以匹配第二个查询的效率?

编辑:

Per Rick James的建议我更改了has_data索引:

ALTER TABLE `events`
 ADD PRIMARY KEY (`pv_name`,`time_stamp`), ADD KEY `has_data` (`has_data`,`pv_name`,`time_stamp`);

这将查询报告更改为:

mysql> EXPLAIN SELECT pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | partitions | type | possible_keys    | key      | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | ref  | PRIMARY,has_data | has_data | 1       | const | 267096 |    11.11 | Using where; Using index |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT pv_name, MAX(events.time_stamp) FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | partitions | type | possible_keys    | key      | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | ref  | PRIMARY,has_data | has_data | 1       | const | 267096 |    11.11 | Using where; Using index |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
1 row in set, 1 warning (0.01 sec)

这似乎运行得更快。

编辑:

Rick James要求的测试结果:

mysql> FLUSH STATUS;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
.
.
.
114480 rows in set (0.34 sec)

mysql> SHOW SESSION STATUS LIKE 'Handler%';
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 2      |
| Handler_mrr_init           | 0      |
| Handler_prepare            | 0      |
| Handler_read_first         | 0      |
| Handler_read_key           | 1      |
| Handler_read_last          | 0      |
| Handler_read_next          | 125527 |
| Handler_read_prev          | 0      |
| Handler_read_rnd           | 0      |
| Handler_read_rnd_next      | 0      |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_update             | 0      |
| Handler_write              | 0      |
+----------------------------+--------+
18 rows in set (0.01 sec)

mysql> SELECT COUNT(*) FROM events;
+----------+
| COUNT(*) |
+----------+
|  3683887 |
+----------+
1 row in set (11.66 sec)

编辑:

跑步时间:

mysql> SHOW INDEXES FROM events;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| events |          0 | PRIMARY  |            1 | pv_name     | A         |      216061 |     NULL | NULL   |      | BTREE      |         |               |
| events |          0 | PRIMARY  |            2 | time_stamp  | A         |     4450791 |     NULL | NULL   |      | BTREE      |         |               |
| events |          1 | has_data |            1 | has_data    | A         |         258 |     NULL | NULL   |      | BTREE      |         |               |
| events |          1 | has_data |            2 | pv_name     | A         |      496542 |     NULL | NULL   |      | BTREE      |         |               |
| events |          1 | has_data |            3 | time_stamp  | A         |     4390035 |     NULL | NULL   |      | BTREE      |         |               |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
5 rows in set (0.00 sec)

mysql> EXPLAIN SELECT events.pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | partitions | type | possible_keys    | key      | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | ref  | PRIMARY,has_data | has_data | 1       | const | 267096 |    11.11 | Using where; Using index |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT events.pv_name, MAX(events.time_stamp) FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
| id | select_type | table  | partitions | type | possible_keys    | key      | key_len | ref   | rows   | filtered | Extra                    |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | ref  | PRIMARY,has_data | has_data | 1       | const | 267096 |    11.11 | Using where; Using index |
+----+-------------+--------+------------+------+------------------+----------+---------+-------+--------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)


SELECT events.pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
114480 rows in set (0.37 sec)

SELECT events.pv_name, MAX(events.time_stamp) FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
114480 rows in set (0.30 sec)


mysql> SHOW INDEXES FROM events;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| events |          0 | PRIMARY  |            1 | pv_name     | A         |      422951 |     NULL | NULL   |      | BTREE      |         |               |
| events |          0 | PRIMARY  |            2 | time_stamp  | A         |     4321990 |     NULL | NULL   |      | BTREE      |         |               |
| events |          0 | has_data |            1 | pv_name     | A         |      240067 |     NULL | NULL   |      | BTREE      |         |               |
| events |          0 | has_data |            2 | has_data    | A         |      436525 |     NULL | NULL   |      | BTREE      |         |               |
| events |          0 | has_data |            3 | time_stamp  | A         |     4205163 |     NULL | NULL   |      | BTREE      |         |               |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
5 rows in set (0.00 sec)

mysql> EXPLAIN SELECT events.pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+
| id | select_type | table  | partitions | type  | possible_keys    | key      | key_len | ref  | rows    | filtered | Extra                    |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+
|  1 | SIMPLE      | events | NULL       | index | PRIMARY,has_data | has_data | 251     | NULL | 4462633 |     1.11 | Using where; Using index |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+---------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT events.pv_name, MAX(events.time_stamp) FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+
| id | select_type | table  | partitions | type  | possible_keys    | key      | key_len | ref  | rows   | filtered | Extra                                 |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+
|  1 | SIMPLE      | events | NULL       | range | PRIMARY,has_data | has_data | 251     | NULL | 240076 |   100.00 | Using where; Using index for group-by |
+----+-------------+--------+------------+-------+------------------+----------+---------+------+--------+----------+---------------------------------------+
1 row in set, 1 warning (0.00 sec)

SELECT events.pv_name FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
114480 rows in set (6.79 sec)

SELECT events.pv_name, MAX(events.time_stamp) FROM events WHERE has_data = 0 AND events.time_stamp > 0 AND events.time_stamp < 9999999999999999999 GROUP BY events.pv_name;
114480 rows in set (2.65 sec)

3 个答案:

答案 0 :(得分:1)

根据[文档](Google.JarResolver.ResolutionException: Cannot resolve com.google.firebase:firebase-analytics-unity:1.0.0()进行松散索引扫描):

索引的任何其他部分都不是查询中引用的GROUP BY的部分必须是常量(也就是说,它们必须以与常量相等的方式引用),除了MIN()或MAX的参数( )功能

在第一个查询中,引用了time_stamp但不是常量。在第二个查询中,time_stamp也在MAX()的参数中。因此,在这种情况下,松散的索引扫描适用。

答案 1 :(得分:0)

UNIQUE替换为

INDEX(has_data, pv_name, time_stamp) -- in this order

除非您需要约束,否则通常最好不要创建索引UNIQUE。在这种情况下,您已经限制了子集(pv_name, time_stamp)

构建索引时,请从任何=列(has_data)开始。这允许其余的处理集中在必要的数据上,而不是在has_data的不良值上绊倒。最后放置一个范围(time_stamp),因为可以使用超出范围的任何(通常)。在索引中包含这三列可以为您提供一个&#34;覆盖&#34;索引,所以EXPLAIN应该说&#34;使用索引&#34;。

我建议的索引应该有助于两个查询。

另见my index cookbook

答案 2 :(得分:-1)

在某些特定条件下,可以优化分组。那就是第二个查询中发生的事情。优化称为松散表索引扫描(see MySQL-Documentation

如果你在第一个查询中使用DISTINCT而不是group by,也许这也会有效?或者您可以在文档中查看如何通过优化第一个查询来实现该组。

  

松散索引扫描

     

处理GROUP BY的最有效方法是使用索引直接检索分组列。使用此访问方法,MySQL使用某些索引类型的属性(按键排序)(例如,BTREE)。此属性允许在索引中使用查找组,而无需考虑索引中满足所有WHERE条件的所有键。此访问方法仅考虑索引中的一小部分键,因此称为松散索引扫描。当没有WHERE子句时,松散索引扫描会读取与组数一样多的密钥,这可能比所有密钥的数量小得多。如果WHERE子句包含范围谓词(请参见第9.8.1节“使用EXPLAIN优化查询”中的范围连接类型的讨论),松散索引扫描会查找满足范围条件的每个组的第一个键,并再次读取尽可能少的键。这可以在以下条件下进行:

     
      
  • 查询在一个表上。
  •   
  • GROUP BY只列出构成索引最左边前缀而不包含其他列的列。 (如果查询具有DISTINCT子句而不是GROUP BY,则所有不同的属性引用形成索引的最左前缀的列。)例如,如果表t1具有(c1,c2,c3)上的索引,如果查询具有GROUP BY c1,c2,则松散索引扫描适用。如果查询具有GROUP BY c2,c3(列不是最左边的前缀)或GROUP BY c1,c2,c4(c4不在索引中),则不适用。
  •   
  • 选择列表中使用的唯一聚合函数(如果有)是MIN()和MAX(),并且它们都引用同一列。该列必须位于索引中,并且必须紧跟GROUP BY中的列。
  •   
  • 除了MIN()或MAX()函数的参数之外,索引中除查询中引用的GROUP BY之外的任何其他部分必须是常量(即,它们必须以与常量相等的方式引用)。
  •   
     

对于索引中的列,必须索引完整列值,而不仅仅是前缀。例如,对于c1 VARCHAR(20), INDEX (c1(10)),索引不能用于松散索引扫描。   如果松散索引扫描适用于查询,则EXPLAIN输出在Extra列中显示Using for group-by。

希望这有帮助