Question

我有两个大的（~100 GB）文本文件，必须同时迭代。

Zip适用于较小的文件，但我发现它实际上是从我的两个文件中创建一个行列表。这意味着每一行都存储在内存中。我不需要多次对这些行做任何事情。

handle1 = open('filea', 'r'); handle2 = open('fileb', 'r')

for i, j in zip(handle1, handle2):
    do something with i and j.
    write to an output file.
    no need to do anything with i and j after this.

是否有一个替代zip（）作为生成器，允许我在不使用＆gt; 200GB ram的情况下迭代这两个文件？

Answer 1

itertools有一个功能izip可以做到

from itertools import izip
for i, j in izip(handle1, handle2):
    ...

如果文件大小不同，您可以使用izip_longest，因为izip会停在较小的文件中。

Answer 2

您可以像这样使用 izip_longest 用空行填充较短的文件

python 2.6

中的

from itertools import izip_longest
with handle1 as open('filea', 'r'):
    with handle2 as open('fileb', 'r'): 
        for i, j in izip_longest(handle1, handle2, fillvalue=""):
            ...

或 Python 3 +

from itertools import zip_longest
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'): 
    for i, j in zip_longest(handle1, handle2, fillvalue=""):
        ...

Answer 3

如果要截断到最短的文件：

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

try:
    while 1:
        i = handle1.next()
        j = handle2.next()

        do something with i and j.
        write to an output file.

except StopIteration:
    pass

finally:
    handle1.close()
    handle2.close()

否则

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

i_ended = False
j_ended = False
while 1:
    try:
        i = handle1.next()
    except StopIteration:
        i_ended = True
    try:
        j = handle2.next()
    except StopIteration:
        j_ended = True

        do something with i and j.
        write to an output file.
    if i_ended and j_ended:
        break

handle1.close()
handle2.close()

或者

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

while 1:
    i = handle1.readline()
    j = handle2.readline()

    do something with i and j.
    write to an output file.

    if not i and not j:
        break
handle1.close()
handle2.close()

Answer 4

这样的东西？罗嗦，但它似乎是你所要求的。

可以调整它来执行合适的合并以匹配两个文件之间的键，这通常比简单的zip函数更需要。此外，这不会截断，这就是SQL OUTER JOIN算法所做的，再次与zip和更典型的文件不同。

with open("file1","r") as file1:
    with open( "file2", "r" as file2:
        for line1, line2 in parallel( file1, file2 ):
            process lines

def parallel( file1, file2 ):
    if1_more, if2_more = True, True
    while if1_more or if2_more:
        line1, line2 = None, None # Assume simplistic zip-style matching
        # If you're going to compare keys, then you'd do that before
        # deciding what to read.
        if if1_more:
            try:
                line1= file1.next()
            except StopIteration:
                if1_more= False
        if if2_more:
            try:
                line2= file2.next()
            except StopIteration:
                if2_more= False
        yield line1, line2

zip（）替代迭代两个迭代

4 个答案: