在没有任何键的情况下比较Python中的两个文件

时间:2017-06-24 20:09:25

标签: python-3.x pandas file-comparison

我是python的新手,需要比较两个来源的文件。

要求:

我应该可以比较两个文件并找到两个文件之间的任何差异。文件可能是.csv,.dat和.xlsx

ie)1。文件1中的记录而不是文件2中的记录  2.文件2中的记录而不是文件1中的记录  3.更改了两个文件之间的记录

在大多数文件中,没有任何键可用于我的比较。所以我想用整个记录作为关键。

您可以在此处找到我的基本版本的脚本,并希望得到审核。

此脚本适用于我的示例文件,每个文件有3000行 专家,您认为我的方法有什么问题吗?或者你有更好的方法吗?
 任何改进/建议都非常感谢。 我使用的是python 3.6(64位)

#script to compare two csv files
import pandas as pd

#read file from source1
src1_df = pd.read_csv(r'C:\Users\Samp\compare\Source1.csv', header=None)
#add a indicator field 'Source'as first field in the dataframe
src1_df.insert(0, 'Source', "SRC1")
#remove duplicates
uniq_src1=src1_df.drop_duplicates(keep='first')


#read file from source2
src2_df = pd.read_csv(r'C:\Users\Samp\compare\Source2.csv', header=None)
#add a indicator field 'Source' as first field in the dataframe
src2_df.insert(0, 'Source', "SRC2")
#remove duplicates
 uniq_src2=src2_df.drop_duplicates(keep='first')


#append the two dataframes horizontally 
full_set = pd.concat([uniq_src1,uniq_src2],ignore_index=True)
#drop duplicates based on the entire row but for the first field 'Source'    
diff_df=full_set.drop_duplicates
                   (full_set.columns.difference(['Source']),keep=False)

#write the output to a csv   
diff_df.to_csv(r''C:\Users\Samp\compare\compare_results.csv',
                                            index=False,encoding='utf-8')
#end of script

File1中:

冰箱,Barry French,293,457.81
Heavy Gauge Vinyl,Barry French,293,46.71
Holmes HEPA空气净化器,Carlos Soltero,714,30.94
灯泡,Carlos Soltero,515,4.43
Avery 52,Carlos Soltero,1412,26.92

文件2:

冰箱,Barry French,293,457.81
Heavy Gauge Vinyl,Barry French,293,46.71
Holmes HEPA空气净化器,Carlos Soltero,847,30.94
泛光灯泡,Carlos Soltero,515,4.43
Accessory37,Alan Barnes,2532,-78.96

所以预期的输出是:

SRC1:灯泡,Carlos Soltero,515,4.43
SRC2:泛光灯灯泡,Carlos Soltero,515,4.43
SRC1:Holmes HEPA空气净化器,Carlos Soltero, 714 ,30.94
SRC2:Holmes HEPA空气净化器,Carlos Soltero, 847 ,30.94
SRC1:Avery 52,Carlos Soltero,1412,26.92
SRC2:Accessory37,Alan Barnes,2532,-78.96

非常感谢您的时间!

1 个答案:

答案 0 :(得分:0)

您可以遍历数据框并进行行检查吗?

In [48]: values = []
    ...: for x in range(len(src_1df.index)):
    ...:     if (src_1df.ix[x].tolist() == src_2df.ix[x].tolist()):
    ...:         continue
    ...:     values.append('SRC1:' + str(src_1df.ix[x].values))
    ...:     values.append('SRC2:' + str(src_2df.ix[x].values))
    ...: values
    ...:
Out[48]:
["SRC1:['Holmes HEPA Air Purifier' 'Carlos Soltero' 714 30.940000000000001]",
 "SRC2:['Holmes HEPA Air Purifier' 'Carlos Soltero' 847 30.940000000000001]",
 "SRC1:['Bulbs' 'Carlos Soltero' 515 4.4299999999999997]",
 "SRC2:['Floodlight Bulbs' 'Carlos Soltero' 515 4.4299999999999997]",
 "SRC1:['Avery 52' 'Carlos Soltero' 1412 26.920000000000002]",
 "SRC2:['Accessory37' 'Alan Barnes' 2532 -78.959999999999994]"]