6000万条目,从某个月中选择条目。如何优化数据库?

时间:2011-03-27 18:07:23

标签: mysql sql query-optimization

我有一个拥有6000万条目的数据库。

每个条目都包含:

  • ID
  • 的DataSourceID
  • 一些数据
  • 日期时间

  1. 我需要选择某个月的参赛作品。每个月包含大约200万个条目。

     select * 
       from Entries 
      where time between "2010-04-01 00:00:00" and "2010-05-01 00:00:00"
    

    (查询大约需要1.5分钟)

  2. 我还想从给定的DataSourceID中选择某个月的数据。 (大约需要20秒)

  3. 大约有50-100个不同的DataSourceID。

    有没有办法让它更快?我有什么选择? 如何优化此数据库/查询?


    编辑:约有。每秒60-100次插入!

4 个答案:

答案 0 :(得分:7)

要获得特定年份的特定年份的参赛作品,请更快 - need to index the time column

CREATE INDEX idx_time ON ENTRIES(time) USING BTREE;

此外,请使用:

SELECT e.* 
  FROM ENTRIES e
 WHERE e.time BETWEEN '2010-04-01' AND DATE_SUB('2010-05-01' INTERVAL 1 SECOND)

...因为BETWEEN具有包容性,因此您可以使用您发布的查询获得“2010-05-01 00:00:00”的任何内容。

我还想从给定的DataSourceID

中选择某个月的数据

您可以为datasourceid列添加单独的索引:

CREATE INDEX idx_time ON ENTRIES(datasourceid) USING BTREE;

...或设置覆盖索引以包含两列:

CREATE INDEX idx_time ON ENTRIES(time, datasourceid) USING BTREE;

覆盖索引要求最左边的列必须在查询中用于要使用的索引。在此示例中,首先使time适用于您提到的两种情况 - datasourceid不必用于索引的使用。 但是,您必须通过查看EXPLAIN输出来测试您的查询,以确切知道什么最适合您的数据和正在对该数据执行的查询。

也就是说,索引会减慢INSERT,UPDATE和DELETE语句的速度。如果列数据具有很少的不同值,则索引不会提供很多值 - IE:布尔列是索引的错误选择,因为基数很低。

答案 1 :(得分:6)

利用innodb群集主键索引。

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

这将是非常高效的:

create table datasources
(
year_id smallint unsigned not null,
month_id tinyint unsigned not null,
datasource_id tinyint unsigned not null,
id int unsigned not null, -- needed for uniqueness
data int unsigned not null default 0,
primary key (year_id, month_id, datasource_id, id)
)
engine=innodb;

select * from datasources where year_id = 2011 and month_id between 1 and 3;

select * from datasources where year_id = 2011 and month_id = 4 and datasouce_id = 100;

-- etc..

编辑2

忘了我运行第一个测试脚本,有3个月的数据。这是一个月的结果:0.34和0.69秒。

select d.* from datasources d where d.year_id = 2010 and d.month_id = 3 and datasource_id = 100 order by d.id desc limit 10;
+---------+----------+---------------+---------+-------+
| year_id | month_id | datasource_id | id      | data  |
+---------+----------+---------------+---------+-------+
|    2010 |        3 |           100 | 3290330 | 38434 |
|    2010 |        3 |           100 | 3290329 |  9988 |
|    2010 |        3 |           100 | 3290328 | 25680 |
|    2010 |        3 |           100 | 3290327 | 17627 |
|    2010 |        3 |           100 | 3290326 | 64508 |
|    2010 |        3 |           100 | 3290325 | 14257 |
|    2010 |        3 |           100 | 3290324 | 45950 |
|    2010 |        3 |           100 | 3290323 | 49986 |
|    2010 |        3 |           100 | 3290322 |  2459 |
|    2010 |        3 |           100 | 3290321 | 52971 |
+---------+----------+---------------+---------+-------+
10 rows in set (0.34 sec)

select d.* from datasources d where d.year_id = 2010 and d.month_id = 3 order by d.id desc limit 10;
+---------+----------+---------------+---------+-------+
| year_id | month_id | datasource_id | id      | data  |
+---------+----------+---------------+---------+-------+
|    2010 |        3 |           116 | 3450346 | 42455 |
|    2010 |        3 |           116 | 3450345 | 64039 |
|    2010 |        3 |           116 | 3450344 | 27046 |
|    2010 |        3 |           116 | 3450343 | 23730 |
|    2010 |        3 |           116 | 3450342 | 52380 |
|    2010 |        3 |           116 | 3450341 | 35700 |
|    2010 |        3 |           116 | 3450340 | 20195 |
|    2010 |        3 |           116 | 3450339 | 21758 |
|    2010 |        3 |           116 | 3450338 | 51378 |
|    2010 |        3 |           116 | 3450337 | 34687 |
+---------+----------+---------------+---------+-------+
10 rows in set (0.69 sec)

编辑1

决定用大约测试上述模式。 3年内传播6000万行。每个查询都是冷的运行,即每个查询分别运行,之后重启mysql清除任何缓冲区并且不进行查询缓存。

可以在此处找到完整的测试脚本:http://pastie.org/1723506或以下...

正如您所看到的,即使在我简陋的桌面上,它也是一个非常高性能的架构:)

select count(*) from datasources;
+----------+
| count(*) |
+----------+
| 60306030 |
+----------+

select count(*) from datasources where year_id = 2010;
+----------+
| count(*) |
+----------+
| 16691669 |
+----------+

select
 year_id, month_id, count(*) as counter
from
 datasources
where 
 year_id = 2010
group by
 year_id, month_id;
+---------+----------+---------+
| year_id | month_id | counter |
+---------+----------+---------+
|    2010 |        1 | 1080108 |
|    2010 |        2 | 1210121 |
|    2010 |        3 | 1160116 |
|    2010 |        4 | 1300130 |
|    2010 |        5 | 1860186 |
|    2010 |        6 | 1220122 |
|    2010 |        7 | 1250125 |
|    2010 |        8 | 1460146 |
|    2010 |        9 | 1730173 |
|    2010 |       10 | 1490149 |
|    2010 |       11 | 1570157 |
|    2010 |       12 | 1360136 |
+---------+----------+---------+
12 rows in set (5.92 sec)


select 
 count(*) as counter
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3 and datasource_id = 100;

+---------+
| counter |
+---------+
|   30003 |
+---------+
1 row in set (1.04 sec)

explain
select 
 d.* 
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3 and datasource_id = 100
order by
 d.id desc limit 10;

+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  |rows    | Extra                       |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
|  1 | SIMPLE      | d     | range | PRIMARY       | PRIMARY | 4       | NULL |4451372 | Using where; Using filesort |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
1 row in set (0.00 sec)


select 
 d.* 
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3 and datasource_id = 100
order by
 d.id desc limit 10;

+---------+----------+---------------+---------+-------+
| year_id | month_id | datasource_id | id      | data  |
+---------+----------+---------------+---------+-------+
|    2010 |        3 |           100 | 3290330 | 38434 |
|    2010 |        3 |           100 | 3290329 |  9988 |
|    2010 |        3 |           100 | 3290328 | 25680 |
|    2010 |        3 |           100 | 3290327 | 17627 |
|    2010 |        3 |           100 | 3290326 | 64508 |
|    2010 |        3 |           100 | 3290325 | 14257 |
|    2010 |        3 |           100 | 3290324 | 45950 |
|    2010 |        3 |           100 | 3290323 | 49986 |
|    2010 |        3 |           100 | 3290322 |  2459 |
|    2010 |        3 |           100 | 3290321 | 52971 |
+---------+----------+---------------+---------+-------+
10 rows in set (0.98 sec)


select 
 count(*) as counter
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3;

+---------+
| counter |
+---------+
| 3450345 |
+---------+
1 row in set (1.64 sec)

explain
select 
 d.* 
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3
order by
 d.id desc limit 10;

+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  |rows    | Extra                       |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
|  1 | SIMPLE      | d     | range | PRIMARY       | PRIMARY | 3       | NULL |6566916 | Using where; Using filesort |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
1 row in set (0.00 sec)


select 
 d.* 
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3
order by
 d.id desc limit 10;

+---------+----------+---------------+---------+-------+
| year_id | month_id | datasource_id | id      | data  |
+---------+----------+---------------+---------+-------+
|    2010 |        3 |           116 | 3450346 | 42455 |
|    2010 |        3 |           116 | 3450345 | 64039 |
|    2010 |        3 |           116 | 3450344 | 27046 |
|    2010 |        3 |           116 | 3450343 | 23730 |
|    2010 |        3 |           116 | 3450342 | 52380 |
|    2010 |        3 |           116 | 3450341 | 35700 |
|    2010 |        3 |           116 | 3450340 | 20195 |
|    2010 |        3 |           116 | 3450339 | 21758 |
|    2010 |        3 |           116 | 3450338 | 51378 |
|    2010 |        3 |           116 | 3450337 | 34687 |
+---------+----------+---------------+---------+-------+
10 rows in set (1.98 sec)

希望这会有所帮助:)

答案 2 :(得分:2)

您可以使用索引来交换磁盘使用情况以提高查询速度。启动time列的索引可以加快询问特定月份的查询:

create index IX_YourTable_Date on YourTable (time, DataSourceID, ID, SomeData)

因为索引以time字段开头,所以MySQL可以对索引进行键范围扫描。这应该和它一样快。索引应包括查询中的所有列,否则MySQL必须从索引到每行的表数据。由于您要求200万行,MySQL可能会忽略未覆盖的索引。 (覆盖索引=包含查询中所有行的索引。)

如果您从不查询ID,则可以重新定义表格以使用(time, DataSourceID, ID)作为主键:

alter table YourTable add primary key (time, DataSourceID, ID)

这样可以加快time对磁盘空间免费搜索的速度,但ID上的搜索速度会非常慢。

答案 3 :(得分:1)

如果您还没有时间字段,我会尝试输入索引。

对于DataSourceID,您可以尝试使用Enum而不是varchar / int。