Question

我知道这个问题已被问过几次但是我无法通过阅读这些答案来快速查询。

基本上我这里有一张400k行的表。它曾经有超过1.8米的行，查询时间超过17秒，所以我有一个cron作业来切断该表中超过5天的记录，以保持记录大约400k行，因此查询时间刚刚超过5秒和5秒仍然很慢。我们还有一些表涉及超过2m的记录并使用JOIN，所以我更喜欢首先解决这个趋势表以获得更多的exp，然后触摸其他表以在更复杂的情况下提高查询性能。

数据结构：

| _id | doctype | subtype | term | user_id | nug_id  | source | timestamp | confidence |
|-----|---------|---------|------|---------|---------|--------|-----------|------------|
| 123 |  post   | keyword | games| 1000    | 200     | twitter| 143389203 |  0.0123    |

我已将term，timestamp，source，confidence编入索引。

通常我的查询是：

SELECT term, SUM(confidence) AS relevance FROM trends 
WHERE source IN ("twitter", "tumblr", "instagram", "post", "flickr")
GROUP BY term ORDER BY relevance DESC

这是我的结果：

Showing rows 0 - 29 (165032 total, Query took 5.8050 sec)

那么接下来我应该做些什么来优化索引或查询以提高性能。我现在可以预见，当我用JOIN查询时，我的查询时间会有多糟糕。

ADD1：抱歉，我忘了附上EXPLAIN输出。

ADD2：表结构

CREATE TABLE `trends` (
 `_id` bigint(20) NOT NULL AUTO_INCREMENT,
 `doctype` varchar(10) DEFAULT NULL,
 `subtype` varchar(20) DEFAULT NULL,
 `term` varchar(200) DEFAULT NULL,
 `user_id` varchar(100) DEFAULT NULL,
 `nug_id` varchar(100) DEFAULT NULL,
 `timestamp` bigint(20) DEFAULT NULL,
 `source` varchar(100) DEFAULT NULL,
 `confidence` float DEFAULT NULL,
 PRIMARY KEY (`_id`),
 KEY `confidence` (`confidence`),
 KEY `give_me_trends` (`user_id`,`source`),
 KEY `term` (`term`,`source`),
 KEY `timestamp` (`timestamp`,`confidence`),
 KEY `source` (`source`)
) ENGINE=InnoDB AUTO_INCREMENT=95350350 DEFAULT CHARSET=utf8

ADD3：

创建一个名为test_trends的新表并复制trends表中的数据后，我使用source列作为整数进行了测试。我还删除了两列doctype和subtype，因为根本不需要它们。查询如下：

SELECT term, SUM(confidence) AS relevance FROM test_trends 
WHERE source IN (1,2,3,4,5,6,7) 
GROUP BY term ORDER BY relevance DESC

在5.4802秒。

解析如下：

| id  | select_type |    table    |   type |   possible_keys   |   key   |  key_len  |   ref  |  rows  |                     Extra                    |
|-----|-------------|-------------|--------|-------------------|---------|-----------|--------|--------|----------------------------------------------|
|  1  |   SIMPLE    | test_trends |  index |  source,source_2  |  term_2 |    603    |  NULL  | 354324 | Using where; Using temporary; Using filesort |

ADD4：

我的测试表结构：

CREATE TABLE `test_trends` (
 `_id` bigint(20) NOT NULL AUTO_INCREMENT,
 `term` varchar(200) DEFAULT NULL,
 `user_id` varchar(100) DEFAULT NULL,
 `nug_id` varchar(100) DEFAULT NULL,
 `timestamp` bigint(20) DEFAULT NULL,
 `source` tinyint(1) DEFAULT NULL,
 `confidence` float DEFAULT NULL,
 PRIMARY KEY (`_id`),
 KEY `confidence` (`confidence`),
 KEY `give_me_trends` (`user_id`,`source`),
 KEY `term` (`term`,`source`),
 KEY `timestamp` (`timestamp`,`confidence`),
 KEY `source` (`source`),
 KEY `term_2` (`term`),
 KEY `source_2` (`source`,`confidence`,`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=95354268 DEFAULT CHARSET=utf8

我还为term，source，confidence，timestamp建立了索引。

ADD5：

+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table       | Non_unique | Key_name       | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 0          | PRIMARY        | 1            | _id         | A         | 379365      | NULL     | NULL   |      | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | confidence     | 1            | confidence  | A         | 18          | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | give_me_trends | 1            | user_id     | A         | 149         | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | give_me_trends | 2            | source      | A         | 556         | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | term           | 1            | term        | A         | 379365      | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | term           | 2            | source      | A         | 379365      | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | timestamp      | 1            | timestamp   | A         | 13548       | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | timestamp      | 2            | confidence  | A         | 189682      | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | source         | 1            | source      | A         | 107         | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | term_2         | 1            | term        | A         | 379365      | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | source_2       | 1            | source      | A         | 18          | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | source_2       | 2            | confidence  | A         | 189         | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| test_trends | 1          | source_2       | 3            | timestamp   | A         | 189682      | NULL     | NULL   | YES  | BTREE      |         |               |
+-------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

Answer 1

优化此查询会非常困难。查询中有两件事可以帮助索引：

范围谓词（IN()）可以通过source列上的索引来帮助，但如果匹配的行超过行的大约20％，优化程序将不会选择此索引桌子。
GROUP BY列上的索引可以帮助term，以使查询按照该列中值的顺序读取表格。

但是您可以使用索引来帮助查询的这些功能中的一个或另一个，但不能同时支持。

您正在对term_2索引执行完整的索引扫描，这几乎与表扫描一样昂贵。您可以从EXPLAIN中看到它访问了该指数的354,000个叶子。

您还获得了Using temporary; using filesort

我将所有列定义为NOT NULL，如果它们不应该是可空的。我记得，这有助于避免Using where注释。

您应该定义覆盖索引，以确保查询不需要读取索引结构本身之外的任何数据。在列(term, source, confidence)上创建索引。确保term列在该索引中排在第一位，其他两列的顺序并不重要。

确保增加innodb_buffer_pool_size以将索引保留在内存中。

Answer 2

尝试删除order by并在应用程序逻辑中进行排序。

希望它可以最小化您的查询负载。

Answer 3

除了已经提到的其他答案之外，有一件事情是我在varchar列上搜索int，使列不可搜索（不能使用索引）。

从那次搜索中我猜你只是在源代码中存储数字，所以如果是这样的话就把它作为INT列。

Answer 4

（最重要的建议。）两个查询都将受益于＆＃34;覆盖＆＃34; INDEX(source, term, confidence) - 从source开始进行过滤（WHERE），继续查询中使用的其余列。＆＃34;覆盖＆＃34;表示查询将在索引中完成而不会覆盖数据。按此顺序列出的列可以消除GROUP BY的临时表和排序（但ORDER BY除外）。
对字段进行规范化（与使用源代码一样）以缩小数据，从而可能加快速度。（注意＆＃34; key_len＆＃34;在EXPLAIN中）是603。如果索引太大而无法缓存在buffer_pool中，这将特别有用。
如果可行，请缩短(200)。
摆脱冗余索引 - 如果您INDEX(a)，则不需要INDEX(a,b)。
列的某些组合＆＃34; unique＆＃34;？如果是这样，我们可以讨论如何将其转换为PRIMARY KEY。
数据是＆＃34;只写＆＃34;？也就是说，你是否添加新行，但从不更改旧行？如果是这样，那么我们可以谈谈Summary Tables，这可能会给你10倍的加速。（如果适用，这比覆盖指数更重要。）
由于ORDER BY与GROUP BY不同，因此必须使用temp和使用filesort。在应用程序代码中移动ORDER BY可能没有明显的优势。

My cookbook on indexing.

如何通过索引

4 个答案: