有人可以就如何改进我的代码向我提出建议吗?我有4个大的csv文件。第一个是与其他3个文件(file1,file2和file3)进行比较的参考文件。在文件中,有三列。每行是一个单位(例如ABC,DEF,GHI是3个单独的单位)。
col_1 col_2 col_3
A B C
D E F
G H I
我想将file1,file2和file3与参考文件进行比较。如果参考文件中每行的单位出现在所有3个文件中,我想将它们写入文件A.如果每行的单元出现在3个文件中的至少1个中,则应将它们写入文件B.如果3个文件中的任何一个中没有每行的单位,我想将它们写在文件C中。我当前的策略是将文件作为4个单独的列表附加并进行比较。我意识到这种方法是内存密集型的。另外,我的脚本已经运行了很长时间没有最终输出。因此,我想知道是否有更有效的方法解决这个问题?
以下是我的代码:
import csv
reference_1 = open ('reference.csv', 'rt', newline = '')
reader = csv.reader(reference_1, delimiter = ',')
file1 = open ('file1.csv','rt', newline = '')
reader1 = csv.reader(file1, delimiter = ',')
file2 = open ('file2.csv', 'rt',newline = '')
reader2 = csv.reader(file2, delimiter = ',')
file3 = open ('file3.csv', 'rt',newline = '')
reader3 = csv.reader(file3, delimiter = ',')
Common = open ('Common.csv', 'w',newline = '')
writer1 = csv.writer(Common, delimiter = ',')
Partial = open ('Partial.csv', 'w',newline = '')
writer2 = csv.writer(Partial, delimiter = ',')
Absent = open ('Absent.csv', 'w',newline = '')
writer3 = csv.writer(Absent, delimiter = ',')
reference = []
fileA = []
fileB = []
fileC = []
for row in reader:
reference.append (row)
for row in reader1:
fileA.append(row)
for row in reader2:
fileB.append(row)
for row in reader3:
fileC.append(row)
for row in reference:
if row in fileA and row in fileB and row in fileC:
writer1.writerow (row)
continue
elif row in fileA or row in fileB or row in fileC:
writer2.writerow (row)
continue
else:
writer3.writerow (row)
reference_1.close()
file1.close()
file2.close()
file3.close()
Common.close()
Partial.close()
Absent.close()
答案 0 :(得分:1)
假设行的顺序并不重要且参考文件中没有重复的行,这里有一个使用set
的选项。
def file_to_set(filename):
"""Opens a file and returns a set containing each line of the file."""
with open(filename) as f:
return set(f.read().splitlines(True))
def set_to_file(s, filename):
"""Writes a set to file."""
with open(filename, 'w') as f:
f.writelines(s)
def compare_files(ref_filename, *files):
"""Compares a reference file to two or more files."""
if len(files) < 2:
raise TypeError("compare_files expected at least 2 files, got %s" %
len(files))
ref = file_to_set(ref_filename)
file_data = [file_to_set(f) for f in files]
all = file_data[0].union(*file_data[1:])
common = ref.intersection(*file_data)
partial = ref.intersection(all).difference(common)
absent = ref.difference(all)
set_to_file(common, 'common.csv')
set_to_file(partial, 'partial.csv')
set_to_file(absent, 'absent.csv')
compare_files('reference.csv', 'file1.csv', 'file2.csv', 'file3.csv')
这个想法是:
all
)。common
),其中只包含每个文件中的行,包括参考文件。partial
),其中包含参考文件中的行,这些行也出现在至少一个但不是所有其他文件中。absent
),其中包含仅存在于参考文件中的行。common
,partial
和absent
写入文件。