Question

我有两个3GB的文本文件，每个文件大约有8000万行。它们共享99.9％的相同行（文件A有60,000个唯一行，文件B有80,000个唯一行）。

如何在两个文件中快速找到这些独特的行？是否有任何现成的命令行工具？我正在使用Python，但我想找到一个有效的Pythonic方法加载文件并进行比较的可能性较小。

任何建议都表示赞赏。

Answer 1

如果订单很重要，请尝试comm实用程序。如果订单无关紧要，sort file1 file2 | uniq -u。

Answer 2

我认为这是最快的方法（无论是用Python还是其他语言都不应该对IMO太重要）。

注意：

1.我只存储每一行的散列以节省空间（以及可能发生分页的时间）

2.由于上述原因，我只打印出行号;如果你需要实际的行，你只需要再次读取文件

3.我假设哈希函数没有冲突。这几乎是肯定的，但并不完美。

4.II导入hashlib，因为内置的hash（）函数太短而无法避免冲突。

import sys
import hashlib

file = []
lines = []
for i in range(2):
    # open the files named in the command line
    file.append(open(sys.argv[1+i], 'r'))
    # stores the hash value and the line number for each line in file i
    lines.append({})
    # assuming you like counting lines starting with 1
    counter = 1
    while 1:
        # assuming default encoding is sufficient to handle the input file
        line = file[i].readline().encode()
        if not line: break
        hashcode = hashlib.sha512(line).hexdigest()
        lines[i][hashcode] = sys.argv[1+i]+': '+str(counter)
        counter += 1
unique0 = lines[0].keys() - lines[1].keys()
unique1 = lines[1].keys() - lines[0].keys()
result = [lines[0][x] for x in unique0] + [lines[1][x] for x in unique1]

Answer 3

使用60,000或80,000个唯一行，您可以为每个唯一行创建一个字典，并将其映射到一个数字。 mydict["hello world"] => 1等。如果您的平均线数大约为40-80个字符，那么这将是10 MB内存附近。

然后读取每个文件，通过字典将其转换为数字数组。这些将很容易适合内存（8个字节的2个文件* 3GB / 60k行小于1 MB的内存）。然后区分列表。您可以invert the dictionary并使用它打印出不同行的文本。

修改

在回复您的评论时，这是一个示例脚本，可以在从文件中读取时将数字分配给唯一行。

#!/usr/bin/python class Reader: def __init__(self, file): self.count = 0 self.dict = {} self.file = file def readline(self): line = self.file.readline() if not line: return None if self.dict.has_key(line): return self.dict[line] else: self.count = self.count + 1 self.dict[line] = self.count return self.count if __name__ == '__main__': print "Type Ctrl-D to quit." import sys r = Reader(sys.stdin) result = 'ignore' while result: result = r.readline() print result

Answer 4

如果我理解正确，您希望这些文件的行没有重复。这样做了：

uniqA = set(open('fileA', 'r'))

Answer 5

http://www.emeditor.com/可以处理大文件，也可以比较它们。

Answer 6

Python有difflib声称与其他diff实用程序相当具有竞争力，请参阅： http://docs.python.org/library/difflib.html

快速找到两个大文本文件之间的差异

6 个答案: