我已经成功调试了自己的内存泄漏问题。但是,我注意到一些非常奇怪的事件。
for fid, fv in freqDic.iteritems():
outf.write(fid+"\t") #ID
for i, term in enumerate(domain): #Vector
tfidf = self.tf(term, fv) * self.idf( term, docFreqDic)
if i == len(domain) - 1:
outf.write("%f\n" % tfidf)
else:
outf.write("%f\t" % tfidf)
outf.flush()
print "Memory increased by", int(self.memory_mon.usage()) - startMemory
outf.close()
def tf(self, term, freqVector):
total = freqVector[TOTAL]
if total == 0:
return 0
if term not in freqVector: ## When you don't have these lines memory leaks occurs
return 0 ##
return float(freqVector[term]) / freqVector[TOTAL]
def idf(self, term, docFrequencyPerTerm):
if term not in docFrequencyPerTerm:
return 0
return math.log( float(docFrequencyPerTerm[TOTAL])/docFrequencyPerTerm[term])
基本上让我来描述我的问题: 1)我正在进行tfidf计算 2)我追踪到内存泄漏的来源是来自defaultdict。 3)我正在使用来自How to get current CPU and RAM usage in Python?的memory_mon 4)我的内存泄漏的原因如下:a)在self.tf中,如果行:如果不在freqVector中的术语:返回0,则不会导致内存泄漏。 (我自己使用memory_mon验证了这一点,并注意到内存的急剧增加继续增加)
我的问题的解决方案是1)因为fv是一个defaultdict,所以在fv中找不到它的任何引用都会创建一个条目。在一个非常大的域中,这将导致内存泄漏。
我决定使用dict而不是默认的dict,内存问题确实消失了。
我唯一的难题是:因为fv是在“for fid,fq in freqDic.iteritems()”中创建的:“不应该在每个for循环结束时销毁fv?我尝试将gc.collect()放在for循环的末尾,但是gc无法收集所有内容(返回0)。是的,假设是正确的,但是如果for循环确实破坏了所有的临时变量,那么内存应该与for循环保持相当一致。
这就是self.tf中的两行:
Memory increased by 12
Memory increased by 948
Memory increased by 28
Memory increased by 36
Memory increased by 36
Memory increased by 32
Memory increased by 28
Memory increased by 32
Memory increased by 32
Memory increased by 32
Memory increased by 40
Memory increased by 32
Memory increased by 32
Memory increased by 28
没有这两行:
Memory increased by 1652
Memory increased by 3576
Memory increased by 4220
Memory increased by 5760
Memory increased by 7296
Memory increased by 8840
Memory increased by 10456
Memory increased by 12824
Memory increased by 13460
Memory increased by 15000
Memory increased by 17448
Memory increased by 18084
Memory increased by 19628
Memory increased by 22080
Memory increased by 22708
Memory increased by 24248
Memory increased by 26704
Memory increased by 27332
Memory increased by 28864
Memory increased by 30404
Memory increased by 32856
Memory increased by 33552
Memory increased by 35024
Memory increased by 36564
Memory increased by 39016
Memory increased by 39924
Memory increased by 42104
Memory increased by 42724
Memory increased by 44268
Memory increased by 46720
Memory increased by 47352
Memory increased by 48952
Memory increased by 50428
Memory increased by 51964
Memory increased by 53508
Memory increased by 55960
Memory increased by 56584
Memory increased by 58404
Memory increased by 59668
Memory increased by 61208
Memory increased by 62744
Memory increased by 64400
我期待你的回答
编辑: 看来我的术语可能是错误的(或似乎是错误的)。
for fid, fv in freqDic.iteritems()
的内存泄漏!!我知道fv因为1)而增加了大小,但它仍然应该在循环结束时被销毁!记忆不应该继续扩大。这不是内存泄漏吗?答案 0 :(得分:2)
迭代freqDict
不会生成新值,但会传递对dict已经拥有的值的引用。这意味着您将新值添加到freqDict
即使在循环之后保存的fv。
另一个解决方案是在循环后清除freqDict。
一般情况下,Python会通过引用传递所有内容,尽管有时它会出现。字符串和整数是不可变的,如果它们被更改,它们所代表的对象就会被替换。
答案 1 :(得分:1)
我怀疑Python的内存使用量可能会增加,因为浮点数也是Python中的对象,而解释器维护着一个无限且不朽的浮点数的空闲列表。因此,每当浮点计算导致之前没有出现的新浮点数时,Python会在空闲列表中分配一个新的浮点对象,然后它会保留对象,以防以后需要它。
请参阅Python错误跟踪器here中的类似讨论。
答案 2 :(得分:0)
这不是内存泄漏,因为内存没有泄漏,它是由你的默认dict占用的,例如。
from collections import defaultdict
d = defaultdict(int)
for i in xrange(10**7):
a = d[i]
你认为这是内存泄漏吗?你正在为一个字典赋值,内存使用量会因此增加,它与此类似
d = {}
for i in xrange(10**7):
d[i] = 0
这不是内存泄漏。