Question

我有两个大文件（文件名集）。每个文件大约有 30.000 行。我试图找到一种快速的方法来查找文件 1 中文件 2 中不存在的行。

例如，如果这是 file1：

A=1
B=2
C=3

这是文件 2：

A=10
B=20
C=30
D=5

那么我的结果/输出应该是：

D=5

因为 File1 中没有 D=Something。

Answer 1

您可以将文件 1 读取到列表中，然后读取文件 2，然后检查每个条目是否在该列表中。

filelist1 = []
with open(file1, 'r') as f:
    for line in f:
        filelist1.append(line.split('=')[0])

with open(file2, 'r') as f:
    for line in f:
        if line.split('=')[0] not in filelist1:
            print(line)

应该可以。

Answer 2

这适用于给定的示例，而且非常简单：

grep -v -F "`grep -o .*= file1`" file2

我在一个人工创建的 30000 行文件上试了一下，速度很快。

它只是使用 grep -o 创建一个匹配列表，然后将其作为固定字符串输入 grep -F。然后 -v 用于表示“显示不匹配的行”

一些注意事项：

这是区分大小写的，因此 A=10 与 a=10 不同。
假设在 file1 中的任何一行上都只有一个“=”符号，并且它左边的所有内容（包括空格）都是检查的一部分。
可能是一个错误，如果文件 1 包含 A=10 并且文件 2 包含 AAA=10，则会在文件 2 中找到 A=10，因此不会报告该行。我会尝试重写 one-liner 来修复这个错误

另一种选择，它更简单，实际上更好

join -t= -v 2 <(sort file1) <(sort file2)

这个需要先对 file1 和 file2 进行排序，但没有显示上面 grep 版本显示的错误。它也可能更快（我还没有真正检查过）。上述其他注意事项仍然适用。

Answer 3

这样的事情会起作用，但它只会给你“D”，而不是完整的行 - 取决于你下游需要什么以及你的文件到底是什么样的。另外，正如 Aaron 所说，它应该很快。

with open("path/to/file1") as f1, open("path/to/file2") as f2:
    s1 = set(x.split("=")[0] for x in f1)
    s2 = set(x.split("=")[0] for x in f2)
    result = s2-s1

如果您需要整行：

with open("path/to/file1") as f1, open("path/to/file2") as f2:
    s1 = set(x.split("=")[0] for x in f1)
    result = [x for x in f2 if not x.split("=")[0] in s1]

比较 2 个文件中的行的快速方法？

3 个答案: