有很多页面的类别(巨大的偏移量)(stackoverflow如何工作?)

时间:2011-08-20 11:33:13

标签: mysql database

我认为我的问题可以通过了解堆栈流如何工作来解决。

例如,此页面加载几毫秒(<300毫秒): https://stackoverflow.com/questions?page=61440&sort=newest

我可以为该页面考虑的唯一查询类似于SELECT * FROM stuff ORDER BY date DESC LIMIT {pageNumber}*{stuffPerPage}, {pageNumber}*{stuffPerPage}+{stuffPerPage}

这样的查询可能需要几秒钟才能运行,但堆栈溢出页面几乎立即加载。它不能是一个缓存的查询,因为随着时间的推移发布该问题并在每次发布问题时重建缓存都是疯狂的。

那么,您认为这是如何运作的?

(为了让问题更容易,让我们忘掉ORDER BY) 示例(该表完全缓存在ram中并存储在ssd驱动器中)

mysql> select * from thread limit 1000000, 1;
1 row in set (1.61 sec)

mysql> select * from thread limit 10000000, 1;
1 row in set (16.75 sec)

mysql> describe select * from thread limit 1000000, 1;
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows     | Extra |
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+
|  1 | SIMPLE      | thread | ALL  | NULL          | NULL | NULL    | NULL | 64801163 |       |
+----+-------------+--------+------+---------------+------+---------+------+----------+-------+

mysql> select * from thread ORDER BY thread_date DESC limit 1000000, 1;
1 row in set (1 min 37.56 sec)


mysql> SHOW INDEXES FROM thread;
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| thread |          0 | PRIMARY  |            1 | newsgroup_id | A         |      102924 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          0 | PRIMARY  |            2 | thread_id    | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          0 | PRIMARY  |            3 | postcount    | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          0 | PRIMARY  |            4 | thread_date  | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
| thread |          1 | date     |            1 | thread_date  | A         |    47036298 |     NULL | NULL   |      | BTREE      |         |               |
+--------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
5 rows in set (0.00 sec)

2 个答案:

答案 0 :(得分:2)

在日期列上创建BTREE索引,查询将轻而易举地运行

CREATE INDEX date ON stuff(date) USING BTREE

更新:这是我刚做的一个测试:

CREATE TABLE test( d DATE, i INT, INDEX(d) );

使用2,000,000行填充表格,其中包含不同的唯一id s

mysql> SELECT * FROM test LIMIT 1000000, 1;
+------------+---------+
| d          | i       |
+------------+---------+
| 1897-07-22 | 1000000 |
+------------+---------+
1 row in set (0.66 sec)

mysql> SELECT * FROM test ORDER BY d LIMIT 1000000, 1;
+------------+--------+
| d          | i      |
+------------+--------+
| 1897-07-22 | 999980 |
+------------+--------+
1 row in set (1.68 sec)

这是一个有意义的观察:

mysql> EXPLAIN SELECT * FROM test ORDER BY d LIMIT 1000, 1;
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+
|  1 | SIMPLE      | test  | index | NULL          | d    | 4       | NULL | 1001 |       |
+----+-------------+-------+-------+---------------+------+---------+------+------+-------+

mysql> EXPLAIN SELECT * FROM test ORDER BY d LIMIT 10000, 1;
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra          |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
|  1 | SIMPLE      | test  | ALL  | NULL          | NULL | NULL    | NULL | 2000343 | Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+

MySql确实使用OFFSET 1000的索引而不是10000的索引。

更有趣的是,如果我FORCE INDEX查询需要更多时间:

mysql> SELECT * FROM test FORCE INDEX(d) ORDER BY d LIMIT 1000000, 1;
+------------+--------+
| d          | i      |
+------------+--------+
| 1897-07-22 | 999980 |
+------------+--------+
1 row in set (2.21 sec)

答案 1 :(得分:0)

我认为StackOverflow不需要到达偏移量10000000的行。如果date上有索引且LIMIT子句中的数字来自现实世界,则下面的查询应该足够快例子,而不是数百万:)

SELECT * 
FROM stuff 
ORDER BY date DESC 
LIMIT {pageNumber}*{stuffPerPage}, {stuffPerPage}

<强>更新

如果表中的记录相对很少被删除(如在StackOverflow中),那么您可以使用以下解决方案:

SELECT * 
FROM stuff 
WHERE id between 
    {stuffCount}-{pageNumber}*{stuffPerPage}+1 AND 
    {stuffCount}-{pageNumber-1}*{stuffPerPage}
ORDER BY id DESC 

{stuffCount}的位置:

SELECT MAX(id) FROM stuff

如果您在数据库中有一些已删除的记录,那么某些页面的记录将少于{stuffPerPage},但这不应该是问题。 StackOverflow也使用了一些不准确的算法。例如,尝试转到第一页和最后一页,您将看到两个页面每页返回30条记录。但从数学上讲,这是无稽之谈。

设计用于大型数据库的解决方案通常会使用一些通常对普通用户来说无法察觉的黑客攻击。


如今,数百万条记录的分页并不流行,因为它不切实际。目前,使用无限滚动(自动或手动按钮点击)很受欢迎。它更有意义,页面加载速度更快,因为它们不需要重新加载。如果您认为旧记录对您的用户也很有用,那么创建一个包含随机记录的页面(无限滚动)也是个好主意。这是我的意见:))