Question

我有一个带有2M条目的大文本文件largeFile和另一个小于1M的文本文件。

较小文件File2中的所有条目都在File1

中

较大文件中的条目格式为..

helloworld_12345_987654312.zip
helloWorld_12344_987654313.zip
helloWOrld_12346_987654314.zip

较小的文件包含

等数据

987654312
987654313

，即文件扩展名.zip之前的文件名的最后一部分，有人可以给出任何指示我怎么能实现这个

我的尝试是在较小的文件上运行循环并在较大的文件上执行grep并继续删除条目，如果在较大的文件中找到该文件..所以在过程结束时我将丢失丢失的条目在文件中。

尽管这种解决方案有效，但其效率低且粗糙..有人可能会提出更好的方法解决这个问题

Answer 1

Grep有一个开关-f，它从文件中读取模式。将其与-v相结合，只打印不匹配的行，并且您有一个优雅的解决方案。由于您的模式是固定字符串，因此当您使用-F时，可以显着提高性能。

grep -F -v -f smallfile bigfile

我编写了一个python脚本来生成一些测试数据：

bigfile = open('bigfile', 'w')
smallfile = open('smallfile', 'w')

count = 2000000
start = 1000000

for i in range(start, start + count):
  bigfile.write('foo' + str(i) + 'bar\n')
  if i % 2:
    smallfile.write(str(i) + '\n')

bigfile.close()
smallfile.close()

以下是我仅使用2000行（设置为2000）运行的一些测试，因为对于更多行，在没有-F的情况下运行grep所需的时间变得非常荒谬。

$ time grep -v -f smallfile bigfile > /dev/null

real    0m3.075s
user    0m2.996s
sys 0m0.028s

$ time grep -F -v -f smallfile bigfile > /dev/null

real    0m0.011s
user    0m0.000s
sys 0m0.012s

Grep还有一个--mmap开关，可能会根据手册页提高性能。在我的测试中没有性能提升。

对于这些测试，我使用了200万行。

$ time grep -F -v -f smallfile bigfile > /dev/null

real    0m3.900s
user    0m3.736s
sys 0m0.104s

$ time grep -F --mmap -v -f smallfile bigfile > /dev/null

real    0m3.911s
user    0m3.728s
sys 0m0.128s

Answer 2

使用grep。您可以将较小的文件指定为从（使用-f filename）获取模式的文件，并执行-v以获取与模式不匹配的行。

由于您的模式显示已修复，您还可以提供-F选项，以加快grep。

以下内容应该是不言自明的：

$ cat big 
helloworld_12345_987654312.zip
helloWorld_12344_987654313.zip
helloWOrld_12346_987654314.zip
$ cat small 
987654312
987654313
$ grep -F -f small big      # Find lines matching those in the smaller file
helloworld_12345_987654312.zip
helloWorld_12344_987654313.zip
$ grep -F -v -f small big   # Eliminate lines matching those in the smaller file
helloWOrld_12346_987654314.zip

在Bash中生成两个非对称文件之间的Diff

2 个答案: