创建带有分组键列的两个熊猫数据帧的“差异”结果集

时间:2019-04-15 18:34:02

标签: python pandas dataframe

我将尽力做到这一点,并保持重点(简化数据)。我有一个包含四列的数据表(请注意,以后可能会添加更多列),它们都不是唯一的,但是这三列的“ ID”,“ ID2”,“ DO”必须是唯一的作为一个团队。我将把该表放入一个数据框,并将该表的更新版本引入另一个数据框。

如果df是“原始数据”,而df2是“更新数据”,这是查找原始数据发生什么变化的最准确/最有效的方法吗?

import pandas as pd
#Sample Data:
df  = pd.DataFrame({'ID':[546,107,478,546,478], 'ID2':['AUSER','BUSER','CUSER','AUSER','EUSER'], 'DO':[3,6,8,4,6], 'DATA':['ORIG','ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546,123], 'ID2':['BUSER','AUSER','DUSER','AUSER','FUSER'], 'DO':[6,3,2,4,3], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG','CHANGE']})
>>> df
   DATA  DO   ID    ID2
0  ORIG   3  546  AUSER
1  ORIG   6  107  BUSER
2  ORIG   8  478  CUSER
3  ORIG   4  546  AUSER
4  ORIG   6  478  EUSER

>>> df2
     DATA  DO   ID    ID2
0  CHANGE   6  107  BUSER
1  CHANGE   3  546  AUSER
2  CHANGE   2  123  DUSER
3    ORIG   4  546  AUSER
4  CHANGE   3  123  FUSER

#Compare Dataframes
merged = df2.merge(df, indicator=True, how='outer')

#Split the merged comparison into:
# - original records that will be updated or deleted 
# - new records that will be inserted or update the original record.
df_original = merged.loc[merged['_merge'] == 'right_only'].drop(columns=['_merge']).copy()
df_new = merged.loc[merged['_merge'] == 'left_only'].drop(columns=['_merge']).copy()

#Create another merge to determine if the new records will either be updates or inserts
check = pd.merge(df_new,df_original, how='left', left_on=['ID','ID2','DO'], right_on = ['ID','ID2','DO'], indicator=True)
in_temp  = check[['ID','ID2','DO']].loc[check['_merge']=='left_only']
upd_temp = check[['ID','ID2','DO']].loc[check['_merge']=='both']

#Create dataframes for each Transaction:
# - removals: Remove records based on provided key values
# - updates:  Update entire record based on key values
# - inserts:  Insert entire record
removals = pd.concat([df_original[['ID','ID2','DO']],df_new[['ID','ID2','DO']],df_new[['ID','ID2','DO']]]).drop_duplicates(keep=False)
updates  = df2.loc[(df2['ID'].isin(upd_temp['ID']))&(df2['ID2'].isin(upd_temp['ID2']))&(df2['DO'].isin(upd_temp['DO']))].copy()
inserts  = df2.loc[(df2['ID'].isin(in_temp['ID']))&(df2['ID2'].isin(in_temp['ID2']))&(df2['DO'].isin(in_temp['DO']))].copy()

结果:

>>> removals
    ID    ID2  DO
6  478  CUSER   8
8  478  EUSER   6

>>> updates
     DATA  DO   ID    ID2
0  CHANGE   6  107  BUSER
1  CHANGE   3  546  AUSER

>>> inserts
     DATA  DO   ID    ID2
2  CHANGE   2  123  DUSER
4  CHANGE   3  123  FUSER

重述问题。此逻辑是否会一致且正确地标识具有指定键列的两个数据框之间的差异?有没有更有效的方法或pythonic方法呢?

已更新的示例数据,其中包含更多记录和相应的结果。

1 个答案:

答案 0 :(得分:0)

clickByReference()

更改:

clickByIndex()

对于插入物:

import pandas as pd
#Sample Data:
df  = pd.DataFrame({'ID':[546,107,478,546], 'ID2':['AUSER','BUSER','CUSER','AUSER'], 'DO':[3,6,8,4], 'DATA':['ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546], 'ID2':['BUSER','AUSER','DUSER','AUSER'], 'DO':[6,3,2,4], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG']})

拆卸:

#Concat both df and df2 together, and whenever there is two of the same, drop them both
df3 =  pd.concat([df, df2]).drop_duplicates(keep = False)

#Whenever the size of this following group by is 2 or more there was a change.
#Change
df3 = df3.groupby(['ID', 'ID2', 'DO'])['DATA']\
         .size()\
         .reset_index()\
         .query('DATA == 2')

df3.loc[:, 'DATA'] = 'CHANGE'

     ID  ID2    DO    DATA
0   107 BUSER    6   CHANGE
3   546 AUSER    3   CHANGE

编辑

#We can compare the ID comlumn for df and df2 and see whats new in df2

#Inserts
df2[(np.logical_not(df2['ID'].isin(df['ID'])))&
    (np.logical_not(df2['ID2'].isin(df['ID2'])))&
    (np.logical_not(df2['DO'].isin(df['DO'])))]

     ID  ID2    DO   DATA
2   123 DUSER   2   CHANGE

新数据框。对于更改,我们将以完全相同的方式进行:

#Similar logic as above but flipped.

#Removals
df[(np.logical_not(df2['ID'].isin(df['ID'])))&
   (np.logical_not(df2['ID2'].isin(df['ID2'])))&
   (np.logical_not(df2['DO'].isin(df['DO'])))]

     ID  ID2    DO  DATA
2   478 CUSER   8   ORIG

对于插入/删除,我们将执行与上述相同的分组方法,除了查询仅出现一次的分组。然后,我们将对df和df2进行内部联接,以查看已添加/删除的内容。

df  = pd.DataFrame({'ID':[546,107,478,546,478], 'ID2':['AUSER','BUSER','CUSER','AUSER','EUSER'], 'DO':[3,6,8,4,6], 'DATA':['ORIG','ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546,123], 'ID2':['BUSER','AUSER','DUSER','AUSER','FUSER'], 'DO':[6,3,2,4,3], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG','CHANGE']})