Question

我正在尝试计算两个大型csv文件（~4GB）之间的差异，以获取新添加的行并将其写入输出csv文件。我可以使用以下代码为相对较小的文件（~50MB）获取此功能。

input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"

with open(input_file1, 'r') as t1, open(input_file2, 'r') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()

with open(output_path, 'w') as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)

但是，对于较大的文件，上面的代码要么太慢（运行大约半小时），要么因缺少内存空间而崩溃。

是否有更快的方法来获取大型csv文件的差异？

Answer 1

对于初学者，请注意这里的拼写错误：

with open(input_file2, 'r') as t1, open(input_file2, 'r') as t2:

您正在阅读相同文件的两倍（input_file2）

然后，您不必完全读取第二个文件，只需逐行阅读。

对于速度，只需从第一个文件中取出set（快速查找，如果有重复的行，则保存内存）。为此，您必须在编写结果时保持第二个文件处于打开状态：

input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"

with open(input_file1, 'r') as t1:
    fileone = set(t1)

with open(input_file2, 'r') as t2, open(output_path, 'w') as outFile:
    for line in t2:
        if line not in fileone:
            outFile.write(line)

for line in t2逐行读取文件（如果可以的话，总是避免readlines()），所以即使文件很大，内存占用也很小。
fileone需要一些记忆，是的，但希望如果它更小和/或有重复的行，那么多，当然少于readlines()
if line not in fileone可能与以前一样，但它的平均O(1)复杂度，使程序更快

Answer 2

您可以使用数据库，也可以使用排序合并。我会给你基本的算法（而不是python代码）

排序合并说明

我们的想法是将2个文件排序为相同的顺序。然后按顺序阅读2个文件

如果2个文件中的记录相同 - >在两个文件中
如果是旧文件记录＆gt;新文件记录 - ＆gt;记录已已插入
如果旧文件记录＆lt;新文件记录 - ＆gt;记录已已删除

排序合并算法

Sort the 2 files to new SortedFiles using the Operating Systems sort 
(use the whole record as sort key)

Open/Read  SortedOldFile
Open/Read  SortedNewFile

while (not end-of-file-SortedOldFile) and (not end-of-file-SortedOldFile):
    if SortedOldFile.record < SortedNewFile.record:
       ## Deleted processing goes here
       read SortedOldFile
    elseif SortedOldFile.record > SortedNewFile.record:
       ## Insert processing  goes here
       read SortedNewFile
    else
       read SortedOldFile
       read SortedNewFile

while (not end-of-file-SortedOldFile):
   ## Deleted processing
   read SortedOldFile

while (not end-of-file-SortedNewFile):
   ## Insert processing
   read SortedNewFile

优点：

使用最小内存
它扩展为绝对庞大的文件
应该足够快，操作系统排序非常有效

缺点：

使用额外的磁盘空间（磁盘空间最近很便宜）
代码依赖于操作系统

Answer 3

你可以对行进行散列以压缩较小的集合，然后进行比较。
或使用更高级的算法来查找指纹

https://en.wikipedia.org/wiki/Fingerprint_(computing)

import hashlib
input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"

def get_data(file_):
    res = {}
    m = hashlib.md5()
    for i, line in file_:
      hashed_line = m.update(line).hexdigest()
      if hashed_line not in res:
        res[hashed_line ] = []
      res[hashed_line ].append(i)


with open(input_file1, 'r') as t1, open(input_file2, 'r') as t2:
    file1_data =  get_data(t1)
    file2_data =  get_data(t2)
    file2_raw = t2.readlines()


with open(output_path, 'w') as outFile:
    for line in file2_data:
        if line not in file1_data:
            outFile.write(file2_raw[file2_data[line]])

更快速地计算两个csv文件之间的差异

3 个答案:

排序合并说明

排序合并算法