我有两个python词典{word: np.array(float)}
,在第一个词典中我使用300维numpy向量,在第二个词典中(键是相同的) - 150维。第一个文件大小为4.3 GB,第二个文件大小为2.2 GB。
当我用sys.getsizeof()
检查加载的对象时,我得到:
import sys
import pickle
import numpy as np
对于大词典:
with open("big.pickle", 'rb') as f:
source = pickle.load(f)
sys.getsizeof(source)
#201326688
all(val.size==300 for key, val in source.items())
#True
Linux top
命令显示6.22GB:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4669 hcl 20 0 6933232 6,224g 15620 S 0,0 19,9 0:11.74 python3
对于小字典:
with open("small.pickle", 'rb') as f:
source = pickle.load(f)
sys.getsizeof(source)
# 201326688 # Strange!
all(val.size==150 for key, val in source.items())
#True
但是当我用linux top
命令查看python3进程时,我看到6.17GB:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4515 hcl 20 0 6875596 6,170g 16296 S 0,0 19,7 0:08.77 python3
两个词典都是在Python3中使用pickle.HIGHEST_PROTOCOL
保存的,我不想使用json,因为编码和加载速度可能会出错。另外,使用numpy数组对我来说很重要,因为我为这些向量计算np.dot
。
如何缩小包含较小向量的字典的RAM?
更精确的记忆测量:
#big:
sum(val.nbytes for key, val in source.items())
4456416000
#small:
sum(val.nbytes for key, val in source.items())
2228208000
编辑:感谢@ etene的提示,我设法使用hdf5保存并加载我的模型:
保存:
import pickle
import numpy as np
import h5py
with open("reduced_150_normalized.pickle", 'rb') as f:
source = pickle.load(f)
# list to save order
keys = []
values = []
for k, v in source.items():
keys.append(k)
values.append(v)
values = np.array(values)
print(values.shape)
with open('model150_keys.pickle',"wb") as f:
pickle.dump(keys, f,protocol=pickle.HIGHEST_PROTOCOL) # do not store stings in h5! Everything will hang
h5f = h5py.File('model150_values.h5', 'w')
h5f.create_dataset('model_values', data=values)
h5f.close()
生成长度为3713680
的关键短语列表和形状为(3713680, 150)
的向量数组。
装载:
import pickle
import numpy as np
import h5py
with open('model150_keys.pickle',"rb") as f:
keys = pickle.load(f) # do not store stings in h5! Everything will hang
# we will construct model by reading h5 file line-by-line
h5f = h5py.File('model150_values.h5','r')
d=h5f['model_values']
print(len(keys))
print(d.shape)
model = {}
for i,key in enumerate(keys):
model[key]=np.array(d[i,:])
h5f.close()
现在我确实只消耗了3GB的RAM:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5012 hcl 20 0 3564672 2,974g 17800 S 0,0 9,5 4:25.27 python3
@etene,你可以把你的评论写成答案,我会选择它。
唯一的问题是加载现在需要相当长的时间(5分钟),这可能是因为在hdf5文件中为numpy数组中的每个位置进行了查找。如果我可以以某种方式迭代第二个坐标的hdf5,而不加载到RAM中,那将是很棒的。
EDIT2:按照@hpaulj的建议,我将文件加载到块中,现在使用10k中的块时速度与pickle一样快,甚至更快(4s):
import pickle
import numpy as np
import h5py
with open('model150_keys.pickle',"rb") as f:
keys = pickle.load(f) # do not store stings in h5! Everything will hang
# we will construct model by reading h5 file line-by-line
h5f = h5py.File('model150_values.h5','r')
d=h5f['model_values']
print(len(keys))
print(d.shape)
model = {}
# we will load in chunks to speed up loading
for i,key in enumerate(keys):
if i%10000==0:
data = d[i:i+10000,:]
model[key]=data[i%10000,:]
h5f.close()
print(len(model))
谢谢大家!!!
答案 0 :(得分:1)
总结我们在评论中发现的内容:
dict
返回相同的值是正常行为。从文档:"只考虑直接归因于对象的内存消耗,而不是它所引用的对象的内存消耗。" TL; DR 面向未来的读者:如果它非常大,请不要立即反序列化整个数据集,这可能需要不可预测的RAM量。使用HDF5等专用格式并以合理大小的批量切割数据,请记住,较小的读取数量=更多的磁盘I / O和更大的读取数量=更多的内存使用量。