目标是计算pandas数据帧中两组列之间的RMSE。问题是实际使用的内存量几乎是数据帧大小的10倍。这是我用来计算RMSE的代码:
import pandas as pd
import numpy as np
from random import shuffle
# set up test df (actual data is a pre-computed DF stored in HDF5)
dim_x, dim_y = 50, 1000000 # actual dataset dim_y = 56410949
cols = ["a_"+str(i) for i in range(1,(dim_x//2)+1)]
cols_b = ["b_"+str(i) for i in range(1,(dim_x//2)+1)]
cols.extend(cols_b)
df = pd.DataFrame(np.random.uniform(0,10,[dim_y, dim_x]), columns=cols)
# calculate rmse : https://stackoverflow.com/a/46349518
a = df.values
diffs = a[:,1:26] - a[:,26:27]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'].to_pickle('results_rmse.p')
当我使用a = df.values
从df获取值时,该例程的内存使用率根据顶部接近100GB。例程计算这些列之间的差异diffs = a[:,1:26] - a[:,26:27]
,接近120GB然后产生内存错误。如何修改代码以提高内存效率,避免错误,并实际计算我的RMSE值?
答案 0 :(得分:1)
我使用的解决方案是将数据帧缩小:
df = pd.read_hdf('madre_merge_sort32.h5')
for i,d in enumerate(np.array_split(df, 10)):
d.to_pickle(str(i)+".p")
然后我跑过那些腌制的mini-dfs并计算每个中的rmse:
for fn in glob.glob("*.p"):
# process df values
df = pd.read_pickle(fn)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
a= df[df.columns[2:]].as_matrix() # first two cols are non-numeric, so skip
# calculate rmse
diffs = a[:,:25] - a[:,25:]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'] = rmse_out
df.to_pickle("out"+fn)
然后我连接起来了:
dfls = []
for fn in glob.glob("out*.p"):
df = pd.read_pickle(fn)
dfls.append(df)
dfcat = pd.concat(dfls)
Chunking似乎对我有用。