Question

我有两个文本文件，第一个文件是40GB（data2），第二个是50MB左右（data1）我想检查file1中的任何行是否在file2中匹配所以我写了一个python脚本（下面）这样做，使用这个脚本的进程花费太多时间，因为它从file1获取行然后它检查整个file2逐行。

for line in open("data1.txt","r"):
    for line2 in open("data2.txt","r"):
        if line==line2:
            print(line)

有什么办法/代码可以让这个快吗？该脚本自5天开始运行，但仍未完成。有没有办法在过程中显示％或当前行号？

Answer 1

使用一个集合并反转逻辑，检查大数据文件中的任何行是否在f2的行集合中，这是一个较小的50mb文件：

with open("data1.txt", "r") as f1, open("data2.txt", "r") as f2:
    lines = set(f1) # efficient 0(1) lookups using a set
    for line in f2: # single pass over large file 
        if line in lines:
            print(line)

如果您希望行号使用枚举：

with open("data1.txt", "r") as f1, open("data2.txt", "r") as f2:
        lines = set(f1) # efficient 0(1) lookups using a set
        for lined_no, line in enumerate(f2, 1): # single pass over large file      
            # print(line_no) # uncomment if you want to see every line number
            if line in lines:
                print(line,line_no)

匹配两个文本文件中的行

1 个答案: