从this question继续我已经实现了两个执行相同操作的函数,一个是使用重新索引而另一个不是。功能在第3行有所不同:
def update(centroid):
best_mean_dist = 200
clust_members = members_by_centeriod[centroid]
for member in clust_members:
member_mean_dist = 100 - df.ix[member].ix[clust_members].score.mean()
if member_mean_dist<best_mean_dist:
best_mean_dist = member_mean_dist
centroid = member
return centroid,best_mean_dist
def update1(centroid):
best_mean_dist = 200
members_in_clust = members_by_centeriod[centroid]
new_df = df.reindex(members_in_clust, level=0).reindex(members_in_clust, level=1)
for member in members_in_clust:
member_mean_dist = 100 - new_df.ix[member].ix[members_in_clust].score.mean()
if member_mean_dist<best_mean_dist:
best_mean_dist = member_mean_dist
centroid = member
return centroid,best_mean_dist
正在从IPython笔记本电脑单元调用这些函数:
for centroid in centroids:
centroid = [update(centroid) for centroid in centroids]
数据帧df
是一个大型数据帧,大约有400万行,内存大约需要300MB。
使用重新索引的update1
函数要快得多。但是,出现了意想不到的事情 - 在运行重新索引的内容后,只需几次迭代,内存很快就会从大约300MB上升到1.5GB,然后我就会遇到内存冲突。
update
函数不会受到这种行为的影响。我没有得到的两件事:
重新索引制作副本,这很明显。但是每次update1函数完成后,是不是假设死了? newdf
变量应该与创建它的函数一起消失..对吗?
即使垃圾收集器没有立即杀死newdf
,一个内存耗尽,它应该杀死它并且不会引发outOfMemory Exception,对吧?
我尝试在update1函数结束时手动添加del newdf
来杀死df,这没有用。那么可能表明该错误实际上是在重新索引过程本身吗?
修改
我发现了问题,但我不明白这种行为的原因是什么。它是python垃圾收集器,拒绝清理重新索引的数据帧。 这是有效的:
for i in range(2000):
new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
这也是有效的:
def reindex():
new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
score = 100 - new_df.ix[member].ix[clust_members].score.mean()
return score
for i in range(2000):
reindex()
这会导致在内存中重新索引对象保存:
z = []
for i in range(2000):
z.append(reindex())
我认为我的用法天真正确。 newdf
变量如何保持与得分值的关联,为什么?
答案 0 :(得分:0)
这是我的调试代码,当你进行索引时,Index对象将创建_tuples
和engine map
,我认为这两个缓存对象使用了内存。如果我添加标有****
的行,那么内存增加非常小,我的PC上大约6M:
import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc
def get_memory():
pid = os.getpid()
p = psutil.Process(pid)
return p.get_memory_info().rss
def get_object_ids():
return set(id(obj) for obj in gc.get_objects())
m1 = get_memory()
n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])
ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))
m2 = get_memory()
objs1 = get_object_ids()
z = []
for i in range(5):
df2 = df.reindex(ix, level=0).reindex(iy, level=1)
z.append(df2.mean().mean())
df.index._tuples = None # ****
df.index._cleanup() # ****
del df2
gc.collect() # ****
m3 = get_memory()
print (m2-m1)/1e6, (m3-m2)/1e6
from collections import Counter
counter = Counter()
for obj in gc.get_objects():
if id(obj) not in objs1:
typename = type(obj).__name__
counter[typename] += 1
print counter