我正在处理两个非常相似的数据帧,我试图弄清楚如何获取一个而不是另一个的数据 - 反之亦然。
到目前为止,这是我的代码:
import pandas as pd
import numpy as np
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
old = pd.read_excel('File 1')
new = pd.read_excel('File 2')
old['version'] = 'old'
new['version'] = 'new'
full_set = pd.concat([old,new],ignore_index=True)
changes = full_set.drop_duplicates(subset=['ID','Type', 'Total'], keep='last')
duplicated = changes.duplicated(subset=['ID', 'Type'], keep=False)
dupe_accts = changes[duplicated]
change_new = dupe_accts[(dupe_accts['version'] == 'new')]
change_old = dupe_accts[(dupe_accts['version'] == 'old' )]
change_new = change_new.drop(['version'], axis=1)
change_old = change_old.drop(['version'],axis=1)
change_new.set_index('Employee ID', inplace=True)
change_old.set_index('Employee ID', inplace=True)
diff_panel = pd.Panel(dict(df1=change_old,df2=change_new))
diff_output = diff_panel.apply(report_diff, axis=0)
所以下一步就是获取仅旧的数据,而且只有新数据。
我的第一次尝试是:
changes['duplicate']=changes['Employee ID'].isin(dupe_accts)
removed_accounts = changes[(changes['duplicate'] == False) & (changes['version'] =='old')]
答案 0 :(得分:4)
我头晕目眩地看着你的代码!
IIUC:
在indicator=True
merge
考虑数据框old
和new
old = pd.DataFrame(dict(
ID=[1, 2, 3, 4, 5],
Type=list('AAABB'),
Total=[9 for _ in range(5)],
ArbitraryColumn=['blah' for _ in range(5)]
))
new = old.head(2)
然后merge
和query
left_only
old.merge(
new, 'outer', on=['ID', 'Type'],
suffixes=['', '_'], indicator=True
).query('_merge == "left_only"')
ArbitraryColumn ID Total Type ArbitraryColumn_ Total_ _merge
2 blah 3 9 A NaN NaN left_only
3 blah 4 9 B NaN NaN left_only
4 blah 5 9 B NaN NaN left_only
我们可以reindex
限制原始列
old.merge(
new, 'outer', on=['ID', 'Type'],
suffixes=['', '_'], indicator=True
).query('_merge == "left_only"').reindex_axis(old.columns, axis=1)
ArbitraryColumn ID Total Type
2 blah 3 9 A
3 blah 4 9 B
4 blah 5 9 B