Question

我的代码包含此while循环：

while A.shape[0] > 0:
    idx = A.score.values.argmax()
    one_center = A.coordinate.iloc[idx]
    # peak_centers and peak_scores are python lists
    peak_centers.append(one_center)
    peak_scores.append(A.score.iloc[idx])
    # exclude the coordinates around the selected peak
    A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

A是一只大熊猫DataFrame，如下所示：

   score  coordinate
0  0.158           1
1  0.167           2
2  0.175           3
3  0.183           4
4  0.190           5

我试图在A中找到最大分数（峰值），然后在先前找到的峰值周围排除一些坐标（在这种情况下为几百个），然后找到下一个峰值，依此类推。

A这是一只非常大的大熊猫DataFrame。在运行此while循环之前，ipython会话使用了20％的机器内存。我认为运行这个while循环只会导致内存消耗下降，因为我从DataFrame中排除了一些数据。但是，我观察到的是内存使用量不断增加，在某些时候机器内存耗尽。

我在这里错过了什么吗？我是否需要在某处明确释放内存？

这是一个可以使用随机数据复制行为的简短脚本：

import numpy as np
import pandas as pd

A = pd.DataFrame({'score':np.random.random(132346018), 'coordinate':np.arange(1, 132346019)})
peak_centers = []
peak_scores = []
exclusion = 147
while A.shape[0] > 0:
    idx = A.score.values.argmax()
    one_center = A.coordinate.iloc[idx]
    # peak_centers and peak_scores are python lists
    peak_centers.append(one_center)
    peak_scores.append(A.score.iloc[idx])
    # exclude the coordinates around the selected peak
    A = A.loc[(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)]

# terminated the loop after memory consumption gets to 90% of machine memory
# but peak_centers and peak_scores are still short lists
print len(peak_centers)
# output is 16

Answer 1

如果您想要破坏性地改变DataFrame.drop而不复制inplace=True数据的大部分，请A与A一起使用。

places_to_drop = ~(A.coordinate - one_center).between(-exclusion, exclusion)
A.drop(A.index[np.where(places_to_drop)], inplace=True)

The place where the original usage of loc ultimately bottoms out位于_NDFrameIndexer方法_getitem_iterable中。 _LocIndexer是_NDFrameIndexer的子类，并创建_LocIndexer的实例，并填充loc的{{1}}属性。

特别是，DataFrame执行布尔索引的检查，这在您的情况下会发生。然后创建一个新的布尔位置数组（当_getitem_iterable已经是布尔格式时，这会浪费内存。）

key

然后最终在副本中返回“true”位置：

inds, = key.nonzero()

从代码中：return self.obj.take(inds, axis=axis, convert=False)将是您的布尔索引（即表达式key的结果），(A.coordinate <= one_center - exclusion) | (A.coordinate >= one_center + exclusion)将是self.obj实例DataFrame已调用1}}，因此loc只是obj。

A文档解释了默认行为是制作副本。在索引器的当前实现中，没有任何机制允许您传递关键字参数，最终将用于执行DataFrame.take而无需复制。

在任何合理的现代机器上，使用take方法对于您描述的数据大小应该没有问题，因此不应该责怪drop的大小。< / p>

while循环中累积的内存使用量

1 个答案: