我正在尝试计算两个大型csv文件(~4GB)之间的差异,以获取新添加的行并将其写入输出csv文件。我可以使用以下代码为相对较小的文件(~50MB)获取此功能。
input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"
with open(input_file1, 'r') as t1, open(input_file2, 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open(output_path, 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
但是,对于较大的文件,上面的代码要么太慢(运行大约半小时),要么因缺少内存空间而崩溃。
是否有更快的方法来获取大型csv文件的差异?
答案 0 :(得分:3)
对于初学者,请注意这里的拼写错误:
with open(input_file2, 'r') as t1, open(input_file2, 'r') as t2:
您正在阅读相同文件的两倍(input_file2
)
然后,您不必完全读取第二个文件,只需逐行阅读。
对于速度,只需从第一个文件中取出set
(快速查找,如果有重复的行,则保存内存)。为此,您必须在编写结果时保持第二个文件处于打开状态:
input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"
with open(input_file1, 'r') as t1:
fileone = set(t1)
with open(input_file2, 'r') as t2, open(output_path, 'w') as outFile:
for line in t2:
if line not in fileone:
outFile.write(line)
for line in t2
逐行读取文件(如果可以的话,总是避免readlines()
),所以即使文件很大,内存占用也很小。fileone
需要一些记忆,是的,但希望如果它更小和/或有重复的行,那么多,当然少于readlines()
if line not in fileone
可能与以前一样,但它的平均O(1)
复杂度,使程序更快答案 1 :(得分:2)
您可以使用数据库,也可以使用排序合并。我会给你基本的算法(而不是python代码)
我们的想法是将2个文件排序为相同的顺序。然后按顺序阅读2个文件
Sort the 2 files to new SortedFiles using the Operating Systems sort
(use the whole record as sort key)
Open/Read SortedOldFile
Open/Read SortedNewFile
while (not end-of-file-SortedOldFile) and (not end-of-file-SortedOldFile):
if SortedOldFile.record < SortedNewFile.record:
## Deleted processing goes here
read SortedOldFile
elseif SortedOldFile.record > SortedNewFile.record:
## Insert processing goes here
read SortedNewFile
else
read SortedOldFile
read SortedNewFile
while (not end-of-file-SortedOldFile):
## Deleted processing
read SortedOldFile
while (not end-of-file-SortedNewFile):
## Insert processing
read SortedNewFile
优点:
缺点:
答案 2 :(得分:0)
你可以对行进行散列以压缩较小的集合,然后进行比较。
或使用更高级的算法来查找指纹
https://en.wikipedia.org/wiki/Fingerprint_(computing)
import hashlib
input_file1 = "data.csv"
input_file2 = "data_1.csv"
output_path = "out.csv"
def get_data(file_):
res = {}
m = hashlib.md5()
for i, line in file_:
hashed_line = m.update(line).hexdigest()
if hashed_line not in res:
res[hashed_line ] = []
res[hashed_line ].append(i)
with open(input_file1, 'r') as t1, open(input_file2, 'r') as t2:
file1_data = get_data(t1)
file2_data = get_data(t2)
file2_raw = t2.readlines()
with open(output_path, 'w') as outFile:
for line in file2_data:
if line not in file1_data:
outFile.write(file2_raw[file2_data[line]])