Question

我试图读取大文件（~10GB）的文本数据并将每个字符串放入列表中。

corpus = []
for file in files:
        fc = []

        with open(file) as source:
            # Use Multiprocessing to read all lines and add them to the list
            filewords = pool.map(addline, source)

            #Concatenate each sublist in filewords to one list with all stringwords
            filewords = list(itertools.chain(*filewords))

        corpus.append(filewords)

#do something with list
function(corpus)

我应该怎样做才能提高内存效率？有发电机吗？（我没有经验）

Answer 1

在这种情况下，我实际上不一定会使用multiprocessing。 10GB并不是那么多，你可以轻松地做这样简单的事情：

for file in files:
   with open(file) as source:
        for line in source:
             # process

如果您要使用群集，请不要使用multiprocessing，而是使用群集的API。

Answer 2

像Antti Happala建议的那样，看看mmap是否适合您。

如果没有，你可能能够使用生成器，但它实际上取决于你用~10 GB文本文件做什么。如果沿着发电机路走下去，我建议你创建一个类并覆盖__iter__方法。这样，如果你不得不多次迭代文件，你总是得到一个从文件开头开始的生成器。

如果您在函数之间传递生成器，这很重要。

由函数生成的生成器返回对生成器的引用以进行迭代。
覆盖__iter__会返回一个新的生成器。

函数发生器：

def iterfile(my_file):
    with open(my_file) as the_file:
        for line in the_file:
            yield line

__ iter__ generator：

class IterFile(object):

    def __init__(self, my_file):
        self.my_file = my_file

    def __iter__(self):
        with open(self.my_file) as the_file:
            for line in the_file:
                yield line

行为差异：

>>> func_gen = iterfile('/tmp/junk.txt')
>>> iter(func_gen) is iter(func_gen)
True

>>> iter_gen = IterFile('/tmp/junk.txt')
>>> iter(iter_gen) is iter(iter_gen)
False

>>> list(func_gen)
['the only line in the file\n']
>>> list(func_gen)
[]

>>> list(iter_gen)
['the only line in the file\n']
>>> list(iter_gen)
['the only line in the file\n']

更有效地处理大量列表（记忆明智）

2 个答案: