给出2个excel文件,每个文件大约有200列,并有一个公共的索引列-即两个文件中的每一行都将具有name属性,例如,生成具有差异的excel输出文件最好从excel文件2到excel文件1。 差异将定义为文件2中的任何新行,而不是文件1中的新行,并且文件2中的行具有相同的索引(名称),但一个或多个其他列不同。 这里有一个很好的使用熊猫的例子可能有用:Compare 2 Excel files and output an Excel file with differences 但是,很难将该解决方案应用于具有200列的excel文件。
下面是csv格式的2个简化(列数从200个减少到4个)excel文件的示例,索引列是Name。
Name,value,location,Name Copy
Bob,400,Sydney,Bob
Tim,500,Perth,Tim
Name,value,location,Name Copy
Bob,400,Sydney,Bob
Tim,500,Adelaide,Tim
Melanie,600,Brisbane,Melanie
因此,鉴于上述2个输入文件,输出文件应为:
Name,value,location,Name Copy
Tim,500,Adelaide,Tim
Melanie,600,Brisbane,Melanie
因此,输出文件将具有2行(不包括列标题行),第2行是file1中不存在的新行,并且第1行包含从file1到file2的更改。
以下方法有效,但是索引列丢失了(它是[1、2],而不是['Tim','Melanie']:
import pandas as pd
df1 = pd.read_excel('simple1.xlsx', index_col=0)
df2 = pd.read_excel('simple2.xlsx', index_col=0)
df3 = pd.merge(df1, df2, how='right', sort='False', indicator='Indicator')
df4 = df3.loc[df3['Indicator'] == 'right_only']
df5 = df4.drop('Indicator', axis=1)
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df5.to_excel(writer, sheet_name='Sheet1')
writer.save()
答案 0 :(得分:0)
解决方案是使用numpy.array_equal来确定行是否相等:
import sys
import pandas as pd
import numpy as np
# Check for correct number of input arguments
if len(sys.argv) != 4:
print('Usage :\n\tpython {} old_excel_file new_excel_file output_excel_file\n'.format(sys.argv[0]))
quit()
# Import input files into dataframes
old_file = sys.argv[1]
new_file = sys.argv[2]
out_file = sys.argv[3]
df1 = pd.read_excel(old_file, index_col=0)
df2 = pd.read_excel(new_file, index_col=0)
# Merge dataframes, maintaining index
df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False, indicator='Indicator')
# Add right-only rows to output dataframe
right_only_index = df_merged.index[df_merged['Indicator'] == 'right_only']
df_out = df2.loc[right_only_index]
# Iterate through "both" rows, and append ones that are not equal to the output dataframe
both_index = df_merged.index[df_merged['Indicator'] == 'both']
df_both = df2.loc[both_index]
for i, values in df_both.iterrows():
if not np.array_equal(df1.loc[i].values, df2.loc[i].values):
df_out = df_out.append(df2.loc[i])
# Write output dataframe to an Excel file (first the two header rows, and then the data rows)
writer = pd.ExcelWriter(out_file, engine='xlsxwriter')
df_out.to_excel(writer, sheet_name='Sheet1')
writer.save()