答案 0 :(得分:1)
根据How to pick indexes for order by and group by queries这篇文章中的建议,现在的表格如下
CREATE TABLE ClusterMatches
(
cluster_index INT UNSIGNED,
match_index INT UNSIGNED,
id INT NOT NULL AUTO_INCREMENT,
tfidf FLOAT,
PRIMARY KEY (match_index,cluster_index,id,tfidf)
);
CREATE TABLE MatchLookup
(
match_index INT UNSIGNED NOT NULL PRIMARY KEY,
image_match TINYTEXT
);
没有按SUM(tfidf)排序结果的查询看起来像
SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
这消除了使用临时和使用filesort
explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 14938 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
但是如果我在中添加ORDER BY SUM(tfdif)
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+-------------+--------------------+
| match_index | total |
+-------------+--------------------+
| 868 | 0.11126546561718 |
| 4182 | 0.0238558370620012 |
| 2162 | 0.0216601379215717 |
| 1406 | 0.0191618576645851 |
| 4239 | 0.0168981291353703 |
| 1437 | 0.0160425212234259 |
| 2599 | 0.0156466849148273 |
| 394 | 0.0155945559963584 |
| 3116 | 0.0151005545631051 |
| 4028 | 0.0149106932803988 |
+-------------+--------------------+
10 rows in set (0.03 sec)
结果在这个范围内适当快速但是 ORDER BY SUM(tfidf)意味着它使用临时和文件输出
explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 65369 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
我正在寻找一种不使用临时或文件排序的解决方案
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index
HAVING total>0.01 ORDER BY cluster_index;
在哪里我不需要硬编码总数的门槛,任何想法?