我在尝试加速查询时遇到了麻烦,这个查询在仅200万行上花了大约11秒钟。 Here is a link to my sqlfiddle。这是我试图运行的语句和我的EXPLAIN语句。
查询:
SELECT crawl.pk Pk,domains.domain Domain,
CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri,
crawl.redirect Redirect FROM crawl
LEFT JOIN dates ON crawl.date_crawled=dates.pk
LEFT JOIN schemes ON crawl.scheme=schemes.pk
LEFT JOIN domains ON crawl.domain=domains.pk
LEFT JOIN remainders ON crawl.remainder=remainders.pk
WHERE (dates.date < CURDATE() - INTERVAL 30 DAY)
AND crawl.redirect=0
GROUP BY crawl.domain
ORDER BY crawl.date_crawled ASC
LIMIT 50
说明:
+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | dates | ALL | PRIMARY,date | NULL | NULL | NULL | 7 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | crawl | ref | date_crawled_redirect | date_crawled_redirect | 8 | mytable.dates.pk,const | 408644 | |
| 1 | SIMPLE | schemes | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.scheme | 1 | |
| 1 | SIMPLE | domains | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.domain | 1 | |
| 1 | SIMPLE | remainders | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.remainder | 1 | |
+----+-------------+------------+--------+-----------------------+-----------------------+---------+----------------------------+--------+----------------------------------------------+
5 rows in set (2.26 sec)
编辑#1: 根据评论,我已经用连接替换了左连接,并通过连接移动了日期过滤器。遗憾的是,这并没有缩短查询时间。
SELECT crawl.pk Pk,domains.domain Domain, CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, crawl.redirect Redirect
FROM crawl
JOIN schemes ON crawl.scheme=schemes.pk
JOIN domains ON crawl.domain=domains.pk
JOIN remainders ON crawl.remainder=remainders.pk
JOIN dates ON crawl.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY
WHERE crawl.redirect=0
GROUP BY crawl.domain
ORDER BY crawl.date_crawled ASC
LIMIT 50
编辑#2:编辑#2:
我更新了解释:
+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+
| 1 | SIMPLE | dates | range | PRIMARY,date,date_pk,dateBtreeIdx,pk | date_pk | 3 | NULL | 4 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | crawl | ref | domain_remainder,remainder,scheme,date_crawled_redirect | date_crawled_redirect | 8 | mytable.dates.pk,const | 408644 | |
| 1 | SIMPLE | schemes | ALL | PRIMARY | NULL | NULL | NULL | 2 | Using where; Using join buffer |
| 1 | SIMPLE | domains | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.domain | 1 | |
| 1 | SIMPLE | remainders | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.remainder | 1 | |
+----+-------------+------------+--------+---------------------------------------------------------+-----------------------+---------+----------------------------+--------+-----------------------------------------------------------+
编辑#3
+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+
| 1 | PRIMARY | schemes | ALL | PRIMARY | NULL | NULL | NULL | 2 | Using temporary; Using filesort |
| 1 | PRIMARY | crawl | ref | domain_remainder,remainder,scheme,domain | scheme | 4 | mytable.schemes.pk | 1448223 | Using where |
| 1 | PRIMARY | domains | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.domain | 1 | |
| 1 | PRIMARY | remainders | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.remainder | 1 | |
| 2 | DEPENDENT SUBQUERY | dates | unique_subquery | PRIMARY,date,date_pk,dateBtreeIdx,pk | PRIMARY | 4 | func | 1 | Using where |
+----+--------------------+------------+-----------------+------------------------------------------+---------+---------+----------------------------+---------+---------------------------------+
5 rows in set (0.04 sec)
编辑#4:
+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+
| 1 | SIMPLE | dates | range | PRIMARY,date,date_pk,dateBtreeIdx,pk | date_pk | 3 | NULL | 4 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | schemes | ALL | PRIMARY | NULL | NULL | NULL | 2 | Using join buffer |
| 1 | SIMPLE | crawl | ref | scheme_domain_remainder | scheme_domain_remainder | 4 | mytable.schemes.pk | 1455517 | Using where |
| 1 | SIMPLE | domains | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.domain | 1 | |
| 1 | SIMPLE | remainders | eq_ref | PRIMARY | PRIMARY | 4 | mytable.crawl.remainder | 1 | |
+----+-------------+------------+--------+--------------------------------------+-------------------------+---------+----------------------------+---------+-----------------------------------------------------------+
5 rows in set (0.04 sec)
编辑#5
SELECT urls.pk PK, domains.domain Domain, CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri, urls.redirect Redirect, urls.date_crawled DC FROM
(SELECT * FROM (
SELECT * FROM crawl as urls ORDER BY date_crawled ASC
) AS tmp GROUP BY tmp.domain ) as urls
JOIN schemes ON urls.scheme=schemes.pk
JOIN domains ON urls.domain=domains.pk
JOIN remainders ON urls.remainder=remainders.pk
JOIN dates ON urls.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY
WHERE urls.redirect=0
ORDER BY urls.date_crawled ASC
LIMIT 50
答案 0 :(得分:2)
您手头有一个近乎理想的查询。唯一的问题是表dates
中的非最佳索引。正如您在EXPLAIN
输出中所看到的,MySQL无法使用表dates
中的任何索引,因此将其用作第一个表。这会导致您的表crawl
的半优化执行计划,并且需要访问大量的行。
要改善这一点,您应在BTREE
列添加dates.date
索引:
ALTER TABLE dates ADD INDEX dateBtreeIdx USING BTREE (date)
BTREE指数用于范围条件。在您的情况下,&#34;低于&#34;,see here。
基于此,您可以尝试将连接字段Dates.pk
添加到索引中。这可能会进一步加快查询速度,但取决于您的数据。
修改强>
现在MySQL可以使用date.dates
上的索引(type = RANGE和rows = 4)。您没有看到加速,因为现在优化工具不会使用PRIMARY KEY
中的schemes
...
但是,性能问题仍然存在于crawl
。尝试使用IN
查询的其他方法:
SELECT
crawl.pk Pk, domains.domain Domain,
CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri,
crawl.redirect Redirect
FROM
crawl, schemes, domains, remainders
WHERE
crawl.scheme=schemes.pk
AND crawl.domain=domains.pk
AND crawl.remainder=remainders.pk
AND crawl.date_crawled IN (SELECT pk FROM dates WHERE (dates.date < CURDATE() - INTERVAL 30 DAY))
AND crawl.redirect=0
GROUP BY
crawl.domain
ORDER BY
crawl.date_crawled ASC
LIMIT 50
编辑#2
SELECT
urls.pk PK, domains.domain Domain,
CONCAT(schemes.scheme, "://", domains.domain, remainders.remainder) Uri,
urls.redirect Redirect,
urls.date_crawled DC
FROM
(SELECT pk, redirect, date_crawled FROM crawl GROUP BY `domain` ) as urls
JOIN schemes ON urls.scheme=schemes.pk
JOIN domains ON urls.`domain`=domains.pk
JOIN remainders ON urls.remainder=remainders.pk
JOIN dates ON urls.date_crawled=dates.pk AND dates.date < CURDATE() - INTERVAL 30 DAY
WHERE
urls.redirect=0
ORDER BY urls.date_crawled ASC
LIMIT 50