Question

我在mysql中有一个像这样的表

CREATE TABLE IF NOT EXISTS `connections` (
  `src` int(10) unsigned NOT NULL,
  `sport` smallint(5) unsigned NOT NULL,
  `dst` int(10) unsigned NOT NULL,
  `dport` smallint(5) unsigned NOT NULL,
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`src`,`sport`,`dst`,`dport`,`time`),
  KEY `time` (`time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

每天有250万条记录插入此表中。

当我想像一天一样选择一段时间的记录。大约需要7分钟。我该如何改进呢。

我在rails 4.0.0上使用ruby

我的选择是这样的

connections = Connection.select('src, dst, UNIX_TIMESTAMP(time) as time')
                  .where(time: timeFrom..timeTo)
                  .order('time ASC')

从数据库中选择后，我有一个这样的循环：

connections.each do |con|

        link = getServerID(con['src'])
        link = getServerID(con['dst']) if link == 0

        @total[link].append [con['time'] * 1000, con['dst']]
end

在这个循环中，我在src和dst上有一个进程然后我将它添加到哈希这部分相关，我的计算机崩溃了

Answer 1

首先，您应该尝试在没有Rails的情况下直接针对数据库运行SQL查询。这有助于识别瓶颈：查询本身是缓慢还是轨道缓慢？我想SQL部分应该不是问题，但首先要仔细检查。

我猜你最大的问题在于connections.each。这会将所有匹配的行加载到您的应用程序中，并创建它的ActiveRecord模型。让我们做一些数学运算：2.5M entries * 1KB（只是猜测，可能更多）会导致2.5GB数据加载到你的记忆中。您可能会看到使用connection.find_each的改进，因为它会以较小的批次加载连接。

getServerID方法有什么作用？它被称为5M次。

我很确定你无法改进这段代码。看起来像错误的数据库的问题或错误的算法。由于您不太可能希望在网站上显示2.5M条记录，因此最好告诉我们您希望实现的目标。

Answer 2

您可以尝试表格分区：

http://dev.mysql.com/doc/refman/5.1/en/partitioning.html

还有一张漂亮的幻灯片：

http://www.slideshare.net/datacharmer/mysql-partitions-tutorial

Answer 3

如前所述，获取2.5 mio条目需要加载内存/ CPU功率。尝试批量提取记录。

Rails具有批量支持内置：http://api.rubyonrails.org/classes/ActiveRecord/Batches.html

connections.find_each do |con|
    link = getServerID(con['src'])
    link = getServerID(con['dst']) if link == 0

    @total[link].append [con['time'] * 1000, con['dst']]
end

如果这不能解决您的问题，您应该考虑找到一种更好的方法，每次都不要循环这么多的记录。

如何使用mysql每天处理数百万条记录

3 个答案: