我有两个大的(~100 GB)文本文件,必须同时迭代。
Zip适用于较小的文件,但我发现它实际上是从我的两个文件中创建一个行列表。这意味着每一行都存储在内存中。我不需要多次对这些行做任何事情。
handle1 = open('filea', 'r'); handle2 = open('fileb', 'r')
for i, j in zip(handle1, handle2):
do something with i and j.
write to an output file.
no need to do anything with i and j after this.
是否有一个替代zip()作为生成器,允许我在不使用> 200GB ram的情况下迭代这两个文件?
答案 0 :(得分:22)
from itertools import izip
for i, j in izip(handle1, handle2):
...
如果文件大小不同,您可以使用izip_longest
,因为izip
会停在较小的文件中。
答案 1 :(得分:15)
您可以像这样使用 izip_longest 用空行填充较短的文件
python 2.6 中的
from itertools import izip_longest
with handle1 as open('filea', 'r'):
with handle2 as open('fileb', 'r'):
for i, j in izip_longest(handle1, handle2, fillvalue=""):
...
或 Python 3 +
from itertools import zip_longest
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'):
for i, j in zip_longest(handle1, handle2, fillvalue=""):
...
答案 2 :(得分:0)
如果要截断到最短的文件:
handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')
try:
while 1:
i = handle1.next()
j = handle2.next()
do something with i and j.
write to an output file.
except StopIteration:
pass
finally:
handle1.close()
handle2.close()
否则
handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')
i_ended = False
j_ended = False
while 1:
try:
i = handle1.next()
except StopIteration:
i_ended = True
try:
j = handle2.next()
except StopIteration:
j_ended = True
do something with i and j.
write to an output file.
if i_ended and j_ended:
break
handle1.close()
handle2.close()
或者
handle1 = open('filea', 'r')
handle2 = open('fileb', 'r')
while 1:
i = handle1.readline()
j = handle2.readline()
do something with i and j.
write to an output file.
if not i and not j:
break
handle1.close()
handle2.close()
答案 3 :(得分:-1)
这样的东西?罗嗦,但它似乎是你所要求的。
可以调整它来执行合适的合并以匹配两个文件之间的键,这通常比简单的zip函数更需要。此外,这不会截断,这就是SQL OUTER JOIN算法所做的,再次与zip和更典型的文件不同。
with open("file1","r") as file1:
with open( "file2", "r" as file2:
for line1, line2 in parallel( file1, file2 ):
process lines
def parallel( file1, file2 ):
if1_more, if2_more = True, True
while if1_more or if2_more:
line1, line2 = None, None # Assume simplistic zip-style matching
# If you're going to compare keys, then you'd do that before
# deciding what to read.
if if1_more:
try:
line1= file1.next()
except StopIteration:
if1_more= False
if if2_more:
try:
line2= file2.next()
except StopIteration:
if2_more= False
yield line1, line2