zip()替代迭代两个迭代

时间:2010-02-24 03:04:46

标签: python

我有两个大的(~100 GB)文本文件,必须同时迭代。

Zip适用于较小的文件,但我发现它实际上是从我的两个文件中创建一个行列表。这意味着每一行都存储在内存中。我不需要多次对这些行做任何事情。

handle1 = open('filea', 'r'); handle2 = open('fileb', 'r')

for i, j in zip(handle1, handle2):
    do something with i and j.
    write to an output file.
    no need to do anything with i and j after this.

是否有一个替代zip()作为生成器,允许我在不使用> 200GB ram的情况下迭代这两个文件?

4 个答案:

答案 0 :(得分:22)

itertools有一个功能izip可以做到

from itertools import izip
for i, j in izip(handle1, handle2):
    ...

如果文件大小不同,您可以使用izip_longest,因为izip会停在较小的文件中。

答案 1 :(得分:15)

您可以像这样使用 izip_longest 用空行填充较短的文件

python 2.6

中的

from itertools import izip_longest
with handle1 as open('filea', 'r'):
    with handle2 as open('fileb', 'r'): 
        for i, j in izip_longest(handle1, handle2, fillvalue=""):
            ...

Python 3 +

from itertools import zip_longest
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'): 
    for i, j in zip_longest(handle1, handle2, fillvalue=""):
        ...

答案 2 :(得分:0)

如果要截断到最短的文件:

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

try:
    while 1:
        i = handle1.next()
        j = handle2.next()

        do something with i and j.
        write to an output file.

except StopIteration:
    pass

finally:
    handle1.close()
    handle2.close()

否则

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

i_ended = False
j_ended = False
while 1:
    try:
        i = handle1.next()
    except StopIteration:
        i_ended = True
    try:
        j = handle2.next()
    except StopIteration:
        j_ended = True

        do something with i and j.
        write to an output file.
    if i_ended and j_ended:
        break

handle1.close()
handle2.close()

或者

handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')

while 1:
    i = handle1.readline()
    j = handle2.readline()

    do something with i and j.
    write to an output file.

    if not i and not j:
        break
handle1.close()
handle2.close()

答案 3 :(得分:-1)

这样的东西?罗嗦,但它似乎是你所要求的。

可以调整它来执行合适的合并以匹配两个文件之间的键,这通常比简单的zip函数更需要。此外,这不会截断,这就是SQL OUTER JOIN算法所做的,再次与zip和更典型的文件不同。

with open("file1","r") as file1:
    with open( "file2", "r" as file2:
        for line1, line2 in parallel( file1, file2 ):
            process lines

def parallel( file1, file2 ):
    if1_more, if2_more = True, True
    while if1_more or if2_more:
        line1, line2 = None, None # Assume simplistic zip-style matching
        # If you're going to compare keys, then you'd do that before
        # deciding what to read.
        if if1_more:
            try:
                line1= file1.next()
            except StopIteration:
                if1_more= False
        if if2_more:
            try:
                line2= file2.next()
            except StopIteration:
                if2_more= False
        yield line1, line2