区别与组别

时间:2014-08-04 08:27:32

标签: mysql group-by distinct

我有两张这样的桌子。 '命令'表有21886行。

CREATE TABLE `order` (
  `id` bigint(20) unsigned NOT NULL,
  `reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `idx_reg_date` (`reg_date`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci


CREATE TABLE `order_detail_products` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `order_id` bigint(20) unsigned NOT NULL,
  `order_detail_id` int(11) NOT NULL,
  `prod_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
  KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=572375 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

我的问题在这里。

MariaDB [test]> explain
    -> SELECT DISTINCT A.id
    -> FROM order A
    -> JOIN order_detail_products B ON A.id = B.order_id
    -> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref               | rows  | Extra                                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
|    1 | SIMPLE      | A     | index | PRIMARY       | idx_reg_date | 8       | NULL              | 22151 | Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | B     | ref   | idx_order_id  | idx_order_id | 8       | bom_20140804.A.id |     2 | Using index; Distinct                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)

MariaDB [test]> explain
    -> SELECT A.id
    -> FROM order A
    -> JOIN order_detail_products B ON A.id = B.order_id
    -> GROUP BY A.id
    -> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref               | rows | Extra                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
|    1 | SIMPLE      | A     | index | PRIMARY       | idx_reg_date | 8       | NULL              |   65 | Using index; Using temporary |
|    1 | SIMPLE      | B     | ref   | idx_order_id  | idx_order_id | 8       | bom_20140804.A.id |    2 | Using index                  |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+

上面列出了两个查询返回相同的结果,但是distinct太慢(解释太多行)。 有什么区别?

1 个答案:

答案 0 :(得分:0)

通常建议您使用DISTINCT而不是GROUP BY,因为这是您真正想要的,然后让优化器选择“最佳”执行计划。但是,没有优化器是完美的。使用DISTINCT,优化器可以为执行计划提供更多选项。但这也意味着选择错误的计划有更多的选项

您写道DISTINCT查询是“慢”的,但您没有告诉任何数字。在我的测试中(在 MariaDB 10.0.19 10.3.13 上具有10倍的行),DISTINCT查询的速度(仅)慢了25%( 562ms / 453ms)。 EXPLAIN的结果根本没有帮助。甚至在“说谎”。使用LIMIT 100, 30时,它至少需要读取130行(这是我的EXPLAIN实际为GROUP BY编写的内容),但显示为65。

我无法解释执行时间有25%的差异,但是无论如何引擎似乎都在进行全表/索引扫描,并在对结果进行排序之前可以跳过100行并选择30行。

最好的计划可能是:

  • idx_reg_date索引(表A)中按降序逐行读取行
  • 查看idx_order_id索引(表B)中是否存在匹配项
  • 跳过100条匹配的行
  • 发送30个匹配的行
  • 退出

如果A中大约有10%的行在B中不匹配,则该计划将从A中读取143行。

我能以某种方式强制执行此计划的最佳方法是:

SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100

此查询在156毫秒内返回相同的结果(比GROUP BY快3倍)。但这仍然太慢了。而且可能仍在读取表A中的所有行。

我们可以通过“小”子查询技巧来证明存在更好的计划:

SELECT A.id
FROM (
    SELECT id, reg_date
    FROM `order`
    ORDER BY reg_date DESC
    LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100

此查询在“无时间”(〜0毫秒)内执行,并在我的测试数据上返回相同的结果。尽管它不是100%可靠的,但它表明优化器的工作做得不好。

那我的结论是什么

  • 优化器并非总是做得最好,有时需要帮助
  • 即使我们知道“最佳计划”,也不能总是执行它
  • DISTINCT并不总是比GROUP BY
  • 当所有子句都不能使用索引时-事情变得很棘手

测试架构和伪数据:

drop table if exists `order`;
CREATE TABLE `order` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `idx_reg_date` (`reg_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

insert into `order`(reg_date)
    select from_unixtime(floor(rand(1) * 1000000000)) as reg_date
    from information_schema.COLUMNS a
       , information_schema.COLUMNS b
    limit 218860;

drop table if exists `order_detail_products`;
CREATE TABLE `order_detail_products` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `order_id` bigint(20) unsigned NOT NULL,
  `order_detail_id` int(11) NOT NULL,
  `prod_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
  KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

insert into order_detail_products(id, order_id, order_detail_id, prod_id)
    select null as id
    , floor(rand(2)*218860)+1 as order_id
    , 0 as order_detail_id
    , 0 as prod_id
    from information_schema.COLUMNS a
       , information_schema.COLUMNS b
    limit 437320;

查询:

SELECT DISTINCT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 562 ms

SELECT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
GROUP BY A.id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 453 ms

SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 156 ms

SELECT A.id
FROM (
    SELECT id, reg_date
    FROM `order`
    ORDER BY reg_date DESC
    LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- ~ 0 ms