Question

我的机器学习脚本会生成大量数据（一个根BTree中包含数百万FileStorage}并将其存储在ZODB的cache_size中，主要是因为它们都不会适合RAM。脚本还经常修改以前添加的数据。

当我增加问题的复杂性，因此需要存储更多数据时，我注意到了性能问题 - 脚本现在平均计算数据速度慢了两倍甚至十倍（唯一改变的是数据量存储，然后检索以进行更改）。

我尝试将RelStorage设置为1000到50000之间的各种值。说实话，速度上的差异可以忽略不计。

我想过切换到RelStorage但不幸的是在the docs中他们只提到了如何配置Zope或Plone等框架。我只使用ZODB。

我想知道import ZODB connection = ZODB.connection('zodb.fs', ...) dbroot = connection.root()在我的情况下会更快。

以下是我目前如何设置ZODB连接：

dict

对我来说很明显，ZODB目前是我脚本的瓶颈。我正在寻找有关如何解决这个问题的建议。

我之所以选择ZODB，我认为NoSQL数据库更适合我的情况，我喜欢类似于Python if not hasattr(dbroot, 'actions_values'): dbroot.actions_values = BTree() if not hasattr(dbroot, 'games_played'): dbroot.games_played = 0的接口的想法。

代码和数据结构：

根数据结构：
```
actions_values
```
actions_values = { # BTree str(state): { # BTree # contiains actions (coulmn to pick to be exact, as I'm working on agent playing Connect 4) # and their values(only actions previously taken by the angent are present here), e.g.: 1: 0.4356 5: 0.3456 }, # other states }在概念上构建如下：
```
state
```
1是一个代表游戏板的简单2D数组。其字段的可能值为2，None或board = [ [ None ] * cols for _ in xrange(rows) ]：
```
rows = 6
```
（在我的情况下为cols = 7和should_play = 10000000 transactions_freq = 10000 packing_freq = 50000 player = ReinforcementPlayer(dbroot.actions_values, config) while dbroot.games_played < should_play: # max_epsilon at start and then linearly drops to min_epsilon: epsilon = max_epsilon - (max_epsilon - min_epsilon) * dbroot.games_played / (should_play - 1) dbroot.games_played += 1 sys.stdout.write('\rPlaying game %d of %d' % (dbroot.games_played, should_play)) sys.stdout.flush() board_state = player.play_game(epsilon) if(dbroot.games_played % transactions_freq == 0): print('Commiting...') transaction.commit() if(dbroot.games_played % packing_freq == 0): print('Packing DB...') connection.db().pack()）
主循环：
```
pack
```
（dbroot也需要很长时间，但这不是主要问题;我可以在程序完成后打包数据库）
代码在ReinforcementPlayer上运行（def get_actions_with_values(self, player_id, state): if player_id == 1: lookup_state = state else: lookup_state = state.switch_players() lookup_state_str = str(lookup_state) if lookup_state_str in self.actions_values: return self.actions_values[lookup_state_str] mirror_lookup_state_str = str(lookup_state.mirror()) if mirror_lookup_state_str in self.actions_values: return self.mirror_actions(self.actions_values[mirror_lookup_state_str]) return None def get_value_of_action(self, player_id, state, action, default=0): actions = self.get_actions_with_values(player_id, state) if actions is None: return default return actions.get(action, default) def set_value_of_action(self, player_id, state, action, value): if player_id == 1: lookup_state = state else: lookup_state = state.switch_players() lookup_state_str = str(lookup_state) if lookup_state_str in self.actions_values: self.actions_values[lookup_state_str][action] = value return mirror_lookup_state_str = str(lookup_state.mirror()) if mirror_lookup_state_str in self.actions_values: self.actions_values[mirror_lookup_state_str][self.mirror_action(action)] = value return self.actions_values[lookup_state_str] = BTree() self.actions_values[lookup_state_str][action] = value内）：
```
len(dbroot.actions_values)
```
（名称中带有镜像的函数只需反转列（动作）。完成后连接4个彼此垂直反射的板相同。）

在550000场比赛之后iotop是6018450。

根据days IO操作占90％的时间。

Answer 1

使用任何（其他）数据库可能没有帮助，因为它们受到与ZODB相同的磁盘IO和内存限制。如果你设法将计算卸载到数据库引擎本身（PostgreSQL +使用SQL脚本）它可能会有所帮助，因为数据库引擎将有更多的信息来做出如何执行代码的智能选择，但这里没有什么神奇的东西可以做同样的事情。很可能很容易使用ZODB完成。

可以做些什么的想法：

拥有数据索引而不是加载完整对象（等于SQL“全表扫描”）。保持智能预处理数据的副本：索引，总和，部分。
使对象本身变小（Python类具有__slots__技巧）
以智能方式使用交易。不要试图在一个大块中处理所有数据。
并行处理 - 使用所有CPU内核而不是单线程方法
不要使用BTree - 也许有更高效的用例

有一些脚本的代码示例，实际的RAM和Data.fs大小等将有助于提供进一步的想法。

Answer 2

这里要清楚一点，您实际使用的BTree课程是什么？ OOBTree？

关于那些btree的两个方面：

1）每个BTree由许多桶组成。每个Bucket在分割之前将保留一定数量的项目。我不记得他们目前拥有多少项目，但我曾经尝试为他们调整C代码并重新编译以保留更大的数字，因为所选择的值是在近二十年前选择的。

2）有时可能构建非常不平衡的Btree。例如如果按排序顺序添加值（例如，只会增加的时间戳），那么最终会得到一个最终为O（n）进行搜索的树。几年前，Jarn的人们写了一个脚本，可以重新平衡Zope目录中的BTrees，这可能适合你。

3）您可以使用OOBucket而不是使用OOBTree。这将最终成为ZODB中的一个泡菜，因此在您的使用案例中可能会变得过大，但如果您在单个事务中执行所有写操作而不是更快（以牺牲必须重新编写为代价）在更新时写下整个Bucket。

-Matt

如何提高运行大量数据的脚本的性能？

2 个答案: