Question

假设一个键值表至少有数百万行。
定义一个操作，该操作需要大量的ID（再次，数百万）查找相应的值并对它们求和。

使用数据库，此操作似乎可以接近(disk seek time) * (number of lookups)。

使用平面文件并阅读整个内容，此操作将接近(file size)/(drive transfer rate)。

插入一些（粗略）值（来自维基百科和/或实验）：
seek time = 0.5ms
transfer rate = 64MByte/s
file size = 800M（对于7000万int / double key / values）
65 million value lookups

数据库时间= 0.5ms * 65000000 = 32500s = 9 hours
平面文件= 800M/(64MB/s) = 12s

实验结果并不像MySQL那么糟糕，但平面文件仍然胜出。

实验：
创建InnoDB和MyISAM id / value对表。例如

CREATE TABLE `ivi` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `val` double DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB

填写您选择的3200万行数据。查询：

select sum(val) from ivm where id not in (1,12,121,1121);  //be sure to change the numbers each time or clear the query cache

使用以下代码创建＆amp;从java读取键/值平面文件。

 private static void writeData() throws IOException {
        long t = -System.currentTimeMillis();
        File dat = new File("/home/mark/dat2");
        if (dat.exists()){
            dat.delete();
        }
        FileOutputStream fos = new FileOutputStream(dat);
        ObjectOutputStream os = new ObjectOutputStream(new BufferedOutputStream(fos));
        for (int i=0; i< 32000000; i++){
            os.writeInt(i);
            os.writeDouble(i / 2.0);
        }
        os.flush();
        os.close();        
        t += System.currentTimeMillis();
        System.out.println("time ms = " + t);
    }
    private static void performSummationQuery() throws IOException{
        long t = -System.currentTimeMillis();

        File dat = new File("/home/mark/dat2");
        FileInputStream fin = new FileInputStream(dat);
        ObjectInputStream in = new ObjectInputStream(new BufferedInputStream(fin));
        HashSet<Integer> set = new HashSet<Integer>(Arrays.asList(11, 101, 1001, 10001, 100001));
        int i;
        double d;
        double sum = 0;
        try {
            while (true){
                i = in.readInt();
                d = in.readDouble();
                if (!set.contains(i)){
                    sum += d;
                }
            }
        } catch (EOFException e) {
        }

        System.out.println("sum = " + sum);
        t += System.currentTimeMillis();
        System.out.println("time ms = " + t);
    }

结果：


InnoDB        8.0-8.1s            
MyISAM        3.1-16.5s
Stored proc   80-90s
FlatFile      1.6-2.4s (even after: echo 3 > /proc/sys/vm/drop_caches)

我的实验表明，平面文件在这里胜过数据库。不幸的是，我需要在此表上执行“标准”CRUD操作。但是这就是杀死我的使用模式。

那么最好的方法是让MySQL在大多数情况下表现得像自己一样，但在上面的场景中赢得了一个平面文件？

修改
澄清一些观点：
1.我有几十个这样的表，有些会有数亿行而且我不能将它们全部存储在RAM中我所描述的案例是我需要支持的。与ID关联的值可能会更改，并且ID的选择是临时的。因此，没有办法预先生成＆amp;缓存任何总和。我每次都需要做“找到每个值并将它们全部加起来”的工作。

感谢。

Answer 1

您的数字假设MySQL将100％执行磁盘I / O，而在实践中这种情况很少发生。如果您的MySQL服务器有足够的RAM并且您的表被正确编入索引，那么您的缓存命中率将快速接近100％，并且MySQL将执行非常少的磁盘I / O，这是您的总和操作的直接结果。如果您经常需要处理10,000,000行的计算，您也可以考虑调整模式以反映实际使用情况（根据您的具体需求，保持“缓存”总和并不总是一个坏主意。）

我强烈建议你整理一个测试数据库，投入10万个测试行，并在MySQL中运行一些真实的查询来确定系统的执行方式。花15分钟这样做可以为您提供更准确的信息。

Answer 2

告诉MySQL忽略主要（唯一）索引可以加快查询速度。

对于InnoDB，它会保存第二个查询。在MyISAM上，它可以在最短的时间内保持查询时间。

Cange是要添加

ignore index(`PRIMARY`)

在查询中的表名之后。

修改
我很欣赏所有的输入，但其中大部分都是“你不应该这样做”，“做一些完全不同的事情”，等等。它们都没有解决手头的问题：

“那么我能拥有的最佳方式是什么 MySQL的行为就像大多数人一样时间，但赢得了一个平面文件以上情景？“

到目前为止，我发布的解决方案：使用MyISAM并忽略索引，这个用例似乎最接近平面文件性能，同时在我需要数据库时仍然给我一个数据库。

Answer 3

我使用触发器维护的汇总表，它提供低于1秒的性能 - 如下所示：

select
 st.tot - v.val 
from
 ivi_sum_total st
join
(
 select sum(val) as val from ivi where id in (1,12,121,1121)
) v;

+---------------------+
| st.tot - v.val      |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (0.07 sec)

完整架构

drop table if exists ivi_sum_total;
create table ivi_sum_total
(
tot decimal(65,5) default 0
) 
engine=innodb;

drop table if exists ivi;
create table ivi 
(
id int unsigned not null auto_increment,
val decimal(65,5) default 0,
primary key (id, val)
) 
engine=innodb;

delimiter #

create trigger ivi_before_ins_trig before insert on ivi
for each row
begin
  update ivi_sum_total set tot = tot + new.val;
end#

create trigger ivi_before_upd_trig before update on ivi
for each row
begin
  update ivi_sum_total set tot = (tot - old.val) + new.val;
end#

-- etc...

<强>测试

select count(*) from ivi;

+----------+
| count(*) |
+----------+
| 32000000 |
+----------+

select
 st.tot - v.val 
from
 ivi_sum_total st
join
(
 select sum(val) as val from ivi where id in (1,12,121,1121)
) v;

+---------------------+
| st.tot - v.val      |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (0.07 sec)

select sum(val) from ivi where id not in (1,12,121,1121);

+---------------------+
| sum(val)            |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (29.89 sec)

select * from ivi_sum_total;

+---------------------+
| tot                 |
+---------------------+
| 1048317683047.43227 |
+---------------------+
1 row in set (0.03 sec)


select * from ivi where id = 2;

+----+-------------+
| id | val         |
+----+-------------+
|  2 | 11781.30443 |
+----+-------------+
1 row in set (0.01 sec)

start transaction;
update ivi set val = 0 where id = 2;
commit;

Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

select * from ivi where id = 2;

+----+---------+
| id | val     |
+----+---------+
|  2 | 0.00000 |
+----+---------+
1 row in set (0.00 sec)


select * from ivi_sum_total;

+---------------------+
| tot                 |
+---------------------+
| 1048317671266.12784 |
+---------------------+
1 row in set (0.00 sec)


select
 st.tot - v.val 
from
 ivi_sum_total st
join
(
 select sum(val) as val from ivi where id in (1,12,121,1121)
) v;

+---------------------+
| st.tot - v.val      |
+---------------------+
| 1048317626939.47621 |
+---------------------+
1 row in set (0.01 sec)

select sum(val) from ivi where id not in (1,12,121,1121);

+---------------------+
| sum(val)            |
+---------------------+
| 1048317626939.47621 |
+---------------------+
1 row in set (31.07 sec)

Answer 4

就我所见，你正在比较苹果和橘子。 MySQL（或任何其他关系数据库）不会假设使用一直进行I / O的数据。那么你正在摧毁索引的含义。更糟糕的索引会成为负担，因为它根本不适合RAM。这就是为什么人们使用分片/汇总表。在您的示例中，数据库的大小（因此磁盘io）将远远超过平面文件，因为在数据本身之上存在主索引。因为z5h声明忽略主索引可以节省一些时间，但它永远不会像纯文本文件那样快。

我建议您使用汇总表，例如让bg作业执行汇总，然后使用“实时”表的其余部分UNION此汇总表。但即使是mysql也无法处理快速增长的数据，因为数百万的数据会开始失败。这就是人们为hdfs和map / reduce框架（如hadoop）等分布式系统工作的原因。

P.S：我的技术示例并非100％正确，我只是想了解这些概念。

Answer 5

目前还没有人考虑过一个选项...

由于上述JAVA代码使用HashSet，为什么不使用哈希索引？

默认情况下，MyISAM表中的索引使用BTREE索引默认情况下，MEMORY表中的索引使用HASH索引。

只需强制MyISAM表使用HASH索引而不是BTREE

创建表`ivi`
（
`id` int（11）NOT NULL AUTO_INCREMENT，
`val` double DEFAULT NULL，
主要密钥（`id`）使用HASH
）ENGINE = MyISAM;

现在应该将比赛场地变得平坦。但是，索引范围搜索在使用哈希索引时性能较差。如果您一次检索一个ID，它应该比您之前的MyISAM测试更快。

如果您想更快地加载数据

摆脱AUTO_INCREMENT属性
摆脱主键
使用常规索引

创建表`ivi`
（
`id` int（11）NOT NULL，
`val` double DEFAULT NULL，
KEY id（`id`）使用HASH
）ENGINE = MyISAM;

然后做这样的事情：

ALTER TABLE ivi DISABLE KEYS;
...
...（加载数据并手动生成id）
...
ALTER TABLE ivi ENABLE KEYS;

这将在加载

你也可以考虑调整/etc/my.cnf中的key_buffer_size来处理大量的MyISAM密钥。

尝试一下，让我们知道这是否有帮助以及您找到了什么!!!

Answer 6

是单用户系统吗？

平面文件的性能会因多个用户而显着降低。使用DB，它“应该”调度磁盘读取以满足并行运行的查询。

Answer 7

您可能需要查看NDBAPI。我想这些人能够达到接近使用文件的速度，但仍然将数据存储在InnoDB中。

在这种情况下，如何使MySQL像平面文件一样快？

7 个答案: