我将尽力做到这一点,并保持重点(简化数据)。我有一个包含四列的数据表(请注意,以后可能会添加更多列),它们都不是唯一的,但是这三列的“ ID”,“ ID2”,“ DO”必须是唯一的作为一个团队。我将把该表放入一个数据框,并将该表的更新版本引入另一个数据框。
如果df是“原始数据”,而df2是“更新数据”,这是查找原始数据发生什么变化的最准确/最有效的方法吗?
import pandas as pd
#Sample Data:
df = pd.DataFrame({'ID':[546,107,478,546,478], 'ID2':['AUSER','BUSER','CUSER','AUSER','EUSER'], 'DO':[3,6,8,4,6], 'DATA':['ORIG','ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546,123], 'ID2':['BUSER','AUSER','DUSER','AUSER','FUSER'], 'DO':[6,3,2,4,3], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG','CHANGE']})
>>> df
DATA DO ID ID2
0 ORIG 3 546 AUSER
1 ORIG 6 107 BUSER
2 ORIG 8 478 CUSER
3 ORIG 4 546 AUSER
4 ORIG 6 478 EUSER
>>> df2
DATA DO ID ID2
0 CHANGE 6 107 BUSER
1 CHANGE 3 546 AUSER
2 CHANGE 2 123 DUSER
3 ORIG 4 546 AUSER
4 CHANGE 3 123 FUSER
#Compare Dataframes
merged = df2.merge(df, indicator=True, how='outer')
#Split the merged comparison into:
# - original records that will be updated or deleted
# - new records that will be inserted or update the original record.
df_original = merged.loc[merged['_merge'] == 'right_only'].drop(columns=['_merge']).copy()
df_new = merged.loc[merged['_merge'] == 'left_only'].drop(columns=['_merge']).copy()
#Create another merge to determine if the new records will either be updates or inserts
check = pd.merge(df_new,df_original, how='left', left_on=['ID','ID2','DO'], right_on = ['ID','ID2','DO'], indicator=True)
in_temp = check[['ID','ID2','DO']].loc[check['_merge']=='left_only']
upd_temp = check[['ID','ID2','DO']].loc[check['_merge']=='both']
#Create dataframes for each Transaction:
# - removals: Remove records based on provided key values
# - updates: Update entire record based on key values
# - inserts: Insert entire record
removals = pd.concat([df_original[['ID','ID2','DO']],df_new[['ID','ID2','DO']],df_new[['ID','ID2','DO']]]).drop_duplicates(keep=False)
updates = df2.loc[(df2['ID'].isin(upd_temp['ID']))&(df2['ID2'].isin(upd_temp['ID2']))&(df2['DO'].isin(upd_temp['DO']))].copy()
inserts = df2.loc[(df2['ID'].isin(in_temp['ID']))&(df2['ID2'].isin(in_temp['ID2']))&(df2['DO'].isin(in_temp['DO']))].copy()
结果:
>>> removals
ID ID2 DO
6 478 CUSER 8
8 478 EUSER 6
>>> updates
DATA DO ID ID2
0 CHANGE 6 107 BUSER
1 CHANGE 3 546 AUSER
>>> inserts
DATA DO ID ID2
2 CHANGE 2 123 DUSER
4 CHANGE 3 123 FUSER
重述问题。此逻辑是否会一致且正确地标识具有指定键列的两个数据框之间的差异?有没有更有效的方法或pythonic方法呢?
已更新的示例数据,其中包含更多记录和相应的结果。
答案 0 :(得分:0)
clickByReference()
更改:
clickByIndex()
对于插入物:
import pandas as pd
#Sample Data:
df = pd.DataFrame({'ID':[546,107,478,546], 'ID2':['AUSER','BUSER','CUSER','AUSER'], 'DO':[3,6,8,4], 'DATA':['ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546], 'ID2':['BUSER','AUSER','DUSER','AUSER'], 'DO':[6,3,2,4], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG']})
拆卸:
#Concat both df and df2 together, and whenever there is two of the same, drop them both
df3 = pd.concat([df, df2]).drop_duplicates(keep = False)
#Whenever the size of this following group by is 2 or more there was a change.
#Change
df3 = df3.groupby(['ID', 'ID2', 'DO'])['DATA']\
.size()\
.reset_index()\
.query('DATA == 2')
df3.loc[:, 'DATA'] = 'CHANGE'
ID ID2 DO DATA
0 107 BUSER 6 CHANGE
3 546 AUSER 3 CHANGE
编辑
#We can compare the ID comlumn for df and df2 and see whats new in df2
#Inserts
df2[(np.logical_not(df2['ID'].isin(df['ID'])))&
(np.logical_not(df2['ID2'].isin(df['ID2'])))&
(np.logical_not(df2['DO'].isin(df['DO'])))]
ID ID2 DO DATA
2 123 DUSER 2 CHANGE
新数据框。对于更改,我们将以完全相同的方式进行:
#Similar logic as above but flipped.
#Removals
df[(np.logical_not(df2['ID'].isin(df['ID'])))&
(np.logical_not(df2['ID2'].isin(df['ID2'])))&
(np.logical_not(df2['DO'].isin(df['DO'])))]
ID ID2 DO DATA
2 478 CUSER 8 ORIG
对于插入/删除,我们将执行与上述相同的分组方法,除了查询仅出现一次的分组。然后,我们将对df和df2进行内部联接,以查看已添加/删除的内容。
df = pd.DataFrame({'ID':[546,107,478,546,478], 'ID2':['AUSER','BUSER','CUSER','AUSER','EUSER'], 'DO':[3,6,8,4,6], 'DATA':['ORIG','ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546,123], 'ID2':['BUSER','AUSER','DUSER','AUSER','FUSER'], 'DO':[6,3,2,4,3], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG','CHANGE']})