Question

TL; DR： 我有2个巨大的表格查询。它们不是索引。这很慢。因此，我建立索引。它比较慢。为什么这有意义？优化它的正确方法是什么？

背景：

我有2张桌子

person，一个包含有关人员（id, birthdate）
works_in，person与部门之间的0-N关系; works_in包含id, person_id, department_id。

它们是InnoDB表，遗憾的是不能选择切换到MyISAM，因为数据完整性是必需的。

这两张表格很大，除了PRIMARY上的id外，不包含任何索引。

我试图了解每个部门中最年轻人的年龄，这是我提出的问题

SELECT MAX(YEAR(person.birthdate)) as max_year, works_in.department as department
    FROM person
    INNER JOIN works_in
        ON works_in.person_id = person.id
    WHERE person.birthdate IS NOT NULL
    GROUP BY works_in.department

查询有效，但我对表演不满意，因为它需要大约17秒才能运行。这是预期的，因为数据很大并且需要写入磁盘，并且它们不是表中的索引。

EXPLAIN此查询提供了

| id | select_type | table   | type   | possible_keys | key     | key_len | ref                      | rows     | Extra                           | 
|----|-------------|---------|--------|---------------|---------|---------|--------------------------|----------|---------------------------------| 
| 1  | SIMPLE      | works_in| ALL    | NULL          | NULL    | NULL    | NULL                     | 22496409 | Using temporary; Using filesort | 
| 1  | SIMPLE      | person  | eq_ref | PRIMARY       | PRIMARY | 4       | dbtest.works_in.person_id| 1        | Using where                     |

我为2个表构建了一堆索引，

/* For works_in */
CREATE INDEX person_id ON works_in(person_id);
CREATE INDEX department_id ON works_in(department_id);
CREATE INDEX department_id_person ON works_in(department_id, person_id);
CREATE INDEX person_department_id ON works_in(person_id, department_id);
/* For person */
CREATE INDEX birthdate ON person(birthdate);

EXPLAIN显示了改进，至少是我理解它的方式，看到它现在使用索引并扫描更少的行。

| id | select_type | table   | type  | possible_keys                                    | key                  | key_len | ref              | rows   | Extra                                                 | 
|----|-------------|---------|-------|--------------------------------------------------|----------------------|---------|------------------|--------|-------------------------------------------------------| 
| 1  | SIMPLE      | person  | range | PRIMARY,birthdate                                | birthdate            | 4       | NULL             | 267818 | Using where; Using index; Using temporary; Using f... | 
| 1  | SIMPLE      | works_in| ref   | person,department_id_person,person_department_id | person_department_id | 4       | dbtest.person.id | 3      | Using index                                           |

然而，查询的执行时间翻了一番（从~17s到35s）。

为什么这是有道理的，优化这个的正确方法是什么？

修改

使用Gordon Linoff的回答（第一个），执行时间是~9s（初始的一半）。选择好的索引似乎确实有帮助，但执行时间仍然很高。关于如何改进这个的任何其他想法？

有关数据集的更多信息：

person表中约有5＆000; 000＆＃39,000条记录。
其中只有130＆000,000只有有效（不是NULL）生日
我确实有一个department表，其中包含大约3＆＃000; 000＆＃39,000条记录（它们实际上是项目而不是部门）

Answer 1

对于此查询：

SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
     works_in wi
     ON wi.person_id = p.id
WHERE p.birthdate IS NOT NULL
GROUP BY wi.department;

最佳索引是：person(birthdate, id)和works_in(person_id, department)。这些是覆盖查询的索引，并节省了读取数据页面的额外成本。

顺便说一句，除非很多人有NULL个出生日期（即每个人都有NULL出生日期的部门），查询基本上等同于：

SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
     works_in wi
     ON wi.person_id = p.id
GROUP BY wi.department;

为此，最佳索引为person(id, birthdate)和works_in(person_id, department)。

编辑：

我想不出一个解决问题的简单方法。一种解决方案是更强大的硬件。

如果您确实需要快速获取此信息，则需要进行额外的工作。

一种方法是将最大出生日期添加到departments表，并添加触发器。对于works_in，您需要update，insert和delete的触发器。对于persons，只有update（可能insert和delete将由works_in处理。这样可以节省最终的group by，这应该可以节省很多。

更简单的方法是将最长出生日期添加到works_in。但是，您仍然需要最终聚合，这可能很昂贵。

Answer 2

索引可提高MyISAM表的性能。它降低了InnoDB表的性能。

在您希望查询最多的列上添加索引。数据关系越复杂，特别是当这些关系与自身（例如内部联接）相关时，每个查询的性能越差。

使用索引，引擎必须使用索引来获取匹配值，这很快。然后它必须使用匹配来查找表中的实际行。如果索引没有缩小行数，那么只需查找表中的所有行就可以更快。

When to add an index on a SQL table field (MySQL)?

When to use MyISAM and InnoDB?

https://dba.stackexchange.com/questions/1/what-are-the-main-differences-between-innodb-and-myisam

MySQL为巨大的表索引性能

2 个答案: