我有两个CSV文件,每个文件有6列,并且都有一个公用列{{1}}(用于比较的主键)。例如,EmpID
是:
File1.csv
EmpID1,Name1,Email1,City1,Phone1,Hobby1
120034,Tom Hanks,tom.hanks@gmail.com,Mumbai,8888999,Fishing
是
File2.csv
需要比较文件中的差异,并且只应将不同的行和列添加为新的输出文件,
EmpID2,Name2,Email2,City2,Phone2,Hobby2
120034,Tom Hanks,hanks.tom@gmail.com,Mumbai,8888999,Running
目前,我已经用Python编写了以下代码。现在我想知道如何识别和选择差异。任何指示和帮助将不胜感激。
EmpID1,Email1,Email2,Hobby1,Hobby2
120034,tom.hanks@gmail.com,hanks.tom@gmail.com,Fishing,Running
答案 0 :(得分:1)
首先将文件读取为dict结构,并以“ EMPID”为键指向整个行:
import csv
fieldnames = [] # to store all fieldnames
with open('File1.csv') as f:
cf = csv.DictReader(f, delimiter=',')
data1 = {row['EMPID1']: row for row in cf}
fieldnames.extend(cf.fieldnames)
with open('File2.csv') as f:
cf = csv.DictReader(f, delimiter=',')
data2 = {row['EMPID2']: row for row in cf}
fieldnames.extend(cf.fieldnames)
然后标识两个字典中的所有ID:
ids_to_check = set(data1) & set(data2)
最后,遍历id并比较行本身
with open('OutputFile.csv', 'w') as f:
cw = csv.DictWriter(f, fieldnames, delimiter=',')
cw.writeheader()
for id in ids_to_check:
diff = compare_dict(data1[id], data2[id], fieldnames)
if diff:
cw.writerow(diff)
这是compare_dict
函数的实现:
def compare_dict(d1, d2, fields_compare):
fields_compare = set(field.rstrip('12') for field in fields_compare)
if any(d1[k + '1'] != d2[k + '2'] for k in fields_compare):
# they differ, return a new dict with all fields
result = d1.copy()
result.update(d2)
return result
else:
return {}