Question

代码非常简单。它不应该有任何泄漏，因为所有在函数中完成。并且没有返回。我有一个函数遍历文件中的所有行（~20 MiB）并将它们全部放入列表中提到的功能：

def read_art_file(filename, path_to_dir): import codecs corpus = [] corpus_file = codecs.open(path_to_dir + filename, 'r', 'iso-8859-15') newline = corpus_file.readline().strip() while newline != '': # we put into @article a @newline of file and some other info # (i left those lists blank for readability) article = [newline, [], [], [], [], [], [], [], [], [], [], [], []] corpus.append(article) del newline del article newline = corpus_file.readline().strip() memory_usage('inside function') for article in corpus: for word in article: del word del article del corpus corpus_file.close() memory_usage('inside: after corp deleted') return

这是主要代码：

memory_usage('START') path_to_dir = '/home/soshial/internship/training_data/parser_output/' read_art_file('accounting.n.txt.wpr.art', path_to_dir) memory_usage('outside func') time.sleep(5) memory_usage('END')

所有memory_usage只打印脚本分配的KiB数量。

执行脚本

如果我运行脚本，它会给我：


START记忆：6088 KiB
  内存：393752 KiB（20 MiB文件+列表占用400 MiB）
  内部：公司删除记忆后：43360 KiB
  外部功能记忆： 34300 KiB（34300-6088 = 28 MiB泄露）
  完成记忆：34300 KiB

执行无列表

如果我完全同样的事情，但将article附加到corpus注释掉：

article = [newline, [], [], [], [], [], ...] # we still assign data to `article` # corpus.append(article) # we don't have this string during second execution

这样输出给了我：


START记忆： 6076 KiB
  内存：6076 KiB
  里面：公司删除记忆后：6076 KiB
  外部功能记忆：6076 KiB
  完成记忆： 6076 KiB

问题：

因此，这样就释放了所有内存。我需要释放所有内存，因为我要处理数百个这样的文件是我做错了还是CPython解释器错误？

UPD 即可。这是我检查内存消耗的方法（取自其他一些stackoverflow问题）：

def memory_usage(text = ''): """Memory usage of the current process in kilobytes.""" status = None result = {'peak': 0, 'rss': 0} try: # This will only work on systems with a /proc file system # (like Linux). status = open('/proc/self/status') for line in status: parts = line.split() key = parts[0][2:-1].lower() if key in result: result[key] = int(parts[1]) finally: if status is not None: status.close() print('>', text, 'memory:', result['rss'], 'KiB ') return

Answer 1

请注意，python 绝不保证您的代码使用的任何内存实际上都会返回到操作系统。所有垃圾收集保证的是，收集的对象使用的内存在将来某个时候可以被另一个对象使用。

从我读过的关于内存分配器的Cpython实现的¹，内存在“池”中被分配以提高效率。当池已满时，python将分配一个新池。如果一个池只包含死对象，那么Cpython实际上释放了与该池关联的内存，但是否则没有。这可能导致多个部分完整的池在函数或其他任何东西后闲置。但是，这并不意味着它是“内存泄漏”。（Cpython仍然知道内存，并可能在以后的某个时间释放它。）

^{¹我不是python dev，因此这些细节可能不正确或至少不完整}

Answer 2

此循环

for article in corpus:
    for word in article:
        del word
    del article

没有释放内存。 del word只是递减名称word引用的对象的引用计数。但是，当设置循环变量时，循环将每个对象的引用计数增加。换句话说，由于这个循环，任何对象的引用计数都没有净变化。

当您注释掉对corpus.append的调用时，您没有保留对从文件读取的对象的任何引用，从一次迭代到下一次迭代，因此解释器可以更早地释放内存，这可以解释为减少你观察到的记忆。

大数据结构中的Python内存泄漏（列表，dicts） - 可能是什么原因？

执行脚本

执行无列表

问题：

2 个答案: