Question

我需要创建一个包含数百万个元素的列表（在10 ^ 6到10 ^ 7之间），它从外部文件接收数据。

是否可以在不冻结系统的情况下将这些元素存储在列表中，还是需要使用其他任何方法？

Answer 1

修改
基于讨论，以便将大量数据从文件加载到列表中。我建议以块为单位读取数据作为生成器，然后使用itertools.chain组合这些生成器以获得连接生成器。然后可以迭代最终的生成器以进行进一步的操作/处理。这样我们就可以有效地使用内存。

下面是一个从块中的文件读取并返回块生成器的函数。

def read_data_chunks(file_object, chunk_size=1024): """read file in chunks using Lazy method (generator) chunk size: default 1k """ while True: data = file_object.read(chunk_size) #data = data.strip().rstrip('\n') if not data: break yield data.strip()

接下来，我们从read_data_chunks函数中读取块中的数据，并将不同的块组合在一起。

from itertools import chain f = open('numbers1.txt') gen = iter([]) #start off with an empty generator #adjust chunk size as needed, 10k here, change as applicable #you can experiment with bigger chunks for huge file. for piece in read_data_chunks(f, chunk_size=10240): gen=chain(gen,piece)

现在您可以访问最终生成器以进行进一步处理（例如迭代），就像之前的答案一样。

for i in gen: print i

上一个答案

如果您只想列出10 ^ 6个序列号，您可以执行以下操作。该列表是使用生成器理解创建的，在访问之前实际上不会生成项目（延迟评估）。

如果我们尝试创建使用列表，它将遇到内存错误（对于较大的值，取决于您的32/64位操作系统）。对于例如在我的Windows 64位操作系统上它在10 ** 9时遇到错误。

#memory efficient as we are not actually creating anything at this time. >>> x = (i for i in xrange(10**6)) #ok with gen comprehension >>> x = (i for i in xrange(10**8)) #ok with gen comprehension >>> y = [i for i in xrange(10**8)] #runs into error @ 10**8, ok at 10**6 , 10**7 Traceback (most recent call last): File "<pyshell#1>", line 1, in <module> y = [i for i in xrange(10**8)] MemoryError >>>

即使有发电机补偿，在10 ** 10之后你开始达到极限。然后，您需要切换到不同的途径，例如pytables，pandas或数据库。

>>> >>> x = (i for i in xrange(10**6)) >>> x = (i for i in xrange(10**8)) >>> x = (i for i in xrange(10**9)) >>> x = (i for i in xrange(10**10)) Traceback (most recent call last): File "<pyshell#4>", line 1, in <module> x = (i for i in xrange(10**10)) OverflowError: Python int too large to convert to C long

您可以像在普通列表上一样迭代生成器。

>>> for i in x: print i if i >= 5: break 0 1 2 3 4 5

详细了解生成器表达式here.

如何有效地在python中的列表中存储~10 ^ 6个元素？

1 个答案: