Question

我打算根据每行中的键将一个大约500MB的文件读入一个dict。代码段如下：

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = [l.strip() for l in f2.readlines() if l.strip()]
sample = dict([(l.split("\t")[2].strip("\""), l) for l in lines])    ## convert [(1,2), (3,4)] to {1:2, 3:4}

在内存为4GB的机器上运行时，python会抱怨内存错误。如果我将sample变量的评估表达式更改为[l for l in lines]，则可以正常工作。

起初，我认为这是由于split方法消耗了大量内存，所以我将代码调整为：

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

...

sample = dict([(l[find_nth(l, "\t", 4):].strip(), l) for l in lines])

但结果却是一样。

一项新发现是，无论代码逻辑如何，我都会在没有OOM的情况下正常运行。

有人能就这个问题给我一些想法吗？

Answer 1

您创建了一个包含每一行的列表，该列表将继续存在，直到lines超出范围，然后根据它创建另一个完全不同的字符串列表，然后是dict在它失去记忆之前。只需一步构建dict。

with open("ENST-NM-chr-name.txt") as f:
    sample = {}

    for l in f:
        l = l.strip()

        if l:
            sample[l.split("\t")[2].strip('"')] = l

你可以通过使用生成器表达式而不是列表推导来达到相同的效果，但是对我来说感觉更好（不是strip两次。

Answer 2

如果您将列表转换为生成器，并将您的词典转换为可爱的词典理解，该怎么办？

f2 = open("ENST-NM-chr-name.txt", "r")   # small amount
lines = (l.strip() for l in f2 if l.strip())
sample = {line.split('\t')[2].strip('\"'): line for line in lines}

上面第2行错误地lines = (l.strip() for l in f2.readlines() if l.strip())

生成器和词典理解可能（某种程度上）减轻了内存需求吗？

在Python中处理文件时出现内存错误

2 个答案: