匹配两个文本文件中的行

时间:2015-02-04 09:59:30

标签: python

我有两个文本文件,第一个文件是40GB(data2),第二个是50MB左右(data1) 我想检查file1中的任何行是否在file2中匹配所以我写了一个python脚本(下面)这样做,使用这个脚本的进程花费太多时间,因为它从file1获取行然后它检查整个file2逐行。

for line in open("data1.txt","r"):
    for line2 in open("data2.txt","r"):
        if line==line2:
            print(line)

有什么办法/代码可以让这个快吗?该脚本自5天开始运行,但仍未完成。有没有办法在过程中显示%或当前行号?

1 个答案:

答案 0 :(得分:4)

使用一个集合并反转逻辑,检查大数据文件中的任何行是否在f2的行集合中,这是一个较小的50mb文件:

with open("data1.txt", "r") as f1, open("data2.txt", "r") as f2:
    lines = set(f1) # efficient 0(1) lookups using a set
    for line in f2: # single pass over large file 
        if line in lines:
            print(line)

如果您希望行号使用枚举:

with open("data1.txt", "r") as f1, open("data2.txt", "r") as f2:
        lines = set(f1) # efficient 0(1) lookups using a set
        for lined_no, line in enumerate(f2, 1): # single pass over large file      
            # print(line_no) # uncomment if you want to see every line number
            if line in lines:
                print(line,line_no)