我正在尝试对通过从磁盘读取文件而创建的大型字典进行一些分析。读取操作导致稳定的内存占用。然后我有一个方法,它根据我从该字典复制到临时字典中的数据执行一些计算。我这样做是为了使所有复制和数据使用都在方法中,并且我希望,在方法调用结束时消失。
可悲的是,我做错了什么。 customerdict定义如下(在.py变量顶部定义):customerdict = collections.defaultdict(dict)
对象的格式为{customerid:dictionary {id:0 || 1}}
还有一个名为allids的类似定义的字典。
我有一个计算sim_pearson距离的方法(编程集体智力书的修改代码),如下所示。
def sim_pearson(custID1, custID2):
si = []
smallcustdict = {}
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
#get the list of mutually rated items
for id in smallcustdict[custID1]:
if id in smallcustdict[custID2]:
si.append(id) # = 1
#return 0 if there are no matches
if len(si) == 0: return 0
#add up all the preferences
sum1 = sum([smallcustdict[custID1][id] for id in si])
sum2 = sum([smallcustdict[custID2][id] for id in si])
#sum up the squares
sum1sq = sum([pow(smallcustdict[custID1][id],2) for id in si])
sum2sq = sum([pow(smallcustdict[custID2][id],2) for id in si])
#sum up the products
psum = sum([smallcustdict[custID1][id] * smallcustdict[custID2][id] for id in si])
#calc Pearson score
num = psum - (sum1*sum2/len(si))
den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))
del smallcustdict
del si
del sum1
del sum2
del sum1sq
del sum2sq
del psum
if den == 0:
return 0
return num/den
通过sim_pearson方法的每个循环都会增加python.exe无限制的内存占用量。我尝试使用“del”方法显式删除本地范围的变量。
看着taskmanager,内存以6-10Mb的增量跳跃。设置初始customerdict后,占用空间为137Mb。
任何想法为什么我这样做会耗尽内存?
答案 0 :(得分:3)
我认为这个问题在这里:
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
来自customerdict
的词典在smallcustdict
中被引用 - 所以当您添加它们时,它们会一直存在。这是我唯一可以看到你在哪里做任何会超出范围的事情,所以我想这就是问题所在。
请注意,由于不使用list comps,反复做同样的事情,而不是通用的方法来做事,你在很多地方为自己做了很多工作,更好的版本可能如下:
import collections
import functools
import operator
customerdict = collections.defaultdict(dict)
def sim_pearson(custID1, custID2):
#Declaring as a dict literal is nicer.
smallcustdict = {
custID1: customerdict[custID1],
custID2: customerdict[custID2],
}
# Unchanged, as I'm not sure what the intent is here.
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
#dict views are set-like, so the easier way to do what you want is the intersection of the two.
si = smallcustdict[custID1].viewkeys() & smallcustdict[custID2].viewkeys()
#if not is a cleaner way of checking for no values.
if not si:
return 0
#Made more generic to avoid repetition and wastefully looping repeatedly.
parts = [list(part) for part in zip(*((value[id] for value in smallcustdict.values()) for id in si))]
sums = [sum(part) for part in parts]
sumsqs = [sum(pow(i, 2) for i in part) for part in parts]
psum = sum(functools.reduce(operator.mul, part) for part in zip(*parts))
sum1, sum2 = sums
sum1sq, sum2sq = sumsqs
#Unchanged.
num = psum - (sum1*sum2/len(si))
den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))
#Again using if not.
if not den:
return 0
else:
return num/den
请注意,这完全未经测试,因为您提供的代码不是一个完整的示例。但是,使用它作为改进的基础应该很容易。
答案 1 :(得分:1)
尝试更改以下两行:
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
到
smallcustdict[custID1] = customerdict[custID1].copy()
smallcustdict[custID2] = customerdict[custID2].copy()
这样,当customerdict
函数返回时,您对这两个词典所做的更改不会在sim_pearson()
中保留。