Question

这是我正在使用的python代码。我有一个5GB的文件，我需要根据行号分成大约10-12个文件。但是这段代码会给出内存错误。有人可以告诉我这段代码有什么问题吗？

from itertools import izip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

n = 386972

with open('reviewsNew.txt','rb') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_{0}'.format(i * n), 'w') as fout:
            fout.writelines(g)

Answer 1

只需使用groupby，因此您不需要创建386972个迭代器：

from itertools import groupby

n = 386972
with open('reviewsNew.txt','rb') as f:
    for idx, lines in groupby(enumerate(iterable), lambda (idx, _): idx // n):
        with open('small_file_{0}'.format(idx * n), 'wb') as fout:
            fout.writelines(l for _, l in lines)

将大文件拆分为较小的文件会导致内存错误

1 个答案: