我是python的新手,需要比较两个来源的文件。
要求:
我应该可以比较两个文件并找到两个文件之间的任何差异。文件可能是.csv,.dat和.xlsx
ie)1。文件1中的记录而不是文件2中的记录 2.文件2中的记录而不是文件1中的记录 3.更改了两个文件之间的记录
在大多数文件中,没有任何键可用于我的比较。所以我想用整个记录作为关键。
您可以在此处找到我的基本版本的脚本,并希望得到审核。
此脚本适用于我的示例文件,每个文件有3000行
专家,您认为我的方法有什么问题吗?或者你有更好的方法吗?
任何改进/建议都非常感谢。
我使用的是python 3.6(64位)
#script to compare two csv files
import pandas as pd
#read file from source1
src1_df = pd.read_csv(r'C:\Users\Samp\compare\Source1.csv', header=None)
#add a indicator field 'Source'as first field in the dataframe
src1_df.insert(0, 'Source', "SRC1")
#remove duplicates
uniq_src1=src1_df.drop_duplicates(keep='first')
#read file from source2
src2_df = pd.read_csv(r'C:\Users\Samp\compare\Source2.csv', header=None)
#add a indicator field 'Source' as first field in the dataframe
src2_df.insert(0, 'Source', "SRC2")
#remove duplicates
uniq_src2=src2_df.drop_duplicates(keep='first')
#append the two dataframes horizontally
full_set = pd.concat([uniq_src1,uniq_src2],ignore_index=True)
#drop duplicates based on the entire row but for the first field 'Source'
diff_df=full_set.drop_duplicates
(full_set.columns.difference(['Source']),keep=False)
#write the output to a csv
diff_df.to_csv(r''C:\Users\Samp\compare\compare_results.csv',
index=False,encoding='utf-8')
#end of script
File1中:
冰箱,Barry French,293,457.81
Heavy Gauge Vinyl,Barry French,293,46.71
Holmes HEPA空气净化器,Carlos Soltero,714,30.94
灯泡,Carlos Soltero,515,4.43
Avery 52,Carlos Soltero,1412,26.92
文件2:
冰箱,Barry French,293,457.81
Heavy Gauge Vinyl,Barry French,293,46.71
Holmes HEPA空气净化器,Carlos Soltero,847,30.94
泛光灯泡,Carlos Soltero,515,4.43
Accessory37,Alan Barnes,2532,-78.96
所以预期的输出是:
SRC1:灯泡,Carlos Soltero,515,4.43
SRC2:泛光灯灯泡,Carlos Soltero,515,4.43
SRC1:Holmes HEPA空气净化器,Carlos Soltero, 714 ,30.94
SRC2:Holmes HEPA空气净化器,Carlos Soltero, 847 ,30.94
SRC1:Avery 52,Carlos Soltero,1412,26.92
SRC2:Accessory37,Alan Barnes,2532,-78.96
非常感谢您的时间!
答案 0 :(得分:0)
您可以遍历数据框并进行行检查吗?
In [48]: values = []
...: for x in range(len(src_1df.index)):
...: if (src_1df.ix[x].tolist() == src_2df.ix[x].tolist()):
...: continue
...: values.append('SRC1:' + str(src_1df.ix[x].values))
...: values.append('SRC2:' + str(src_2df.ix[x].values))
...: values
...:
Out[48]:
["SRC1:['Holmes HEPA Air Purifier' 'Carlos Soltero' 714 30.940000000000001]",
"SRC2:['Holmes HEPA Air Purifier' 'Carlos Soltero' 847 30.940000000000001]",
"SRC1:['Bulbs' 'Carlos Soltero' 515 4.4299999999999997]",
"SRC2:['Floodlight Bulbs' 'Carlos Soltero' 515 4.4299999999999997]",
"SRC1:['Avery 52' 'Carlos Soltero' 1412 26.920000000000002]",
"SRC2:['Accessory37' 'Alan Barnes' 2532 -78.959999999999994]"]