我有以下2个数据帧:
DF1:
DATE ID_1 ID_2 RESULT
0 2014-06-16 1 a RED
1 2014-07-01 1 a WHITE
2 2014-08-16 2 c BLUE
3 2015-08-16 3 a RED
DF2
DATE ID_1 ID_2 RESULT
0 2014-06-16 1 z WHITE
1 2014-07-01 1 z WHITE
2 2014-08-16 2 h BLUE
3 2014-08-16 3 k RED
你可以通过运行这个来获得:
df1 = pd.DataFrame(columns=["DATE","ID_1", "ID_2", "RESULT" ])
df2 = pd.DataFrame(columns=["DATE","ID_1", "ID_2","RESULT"])
df1["DATE"] = ['2014-06-16', '2014-07-01', '2014-08-16', '2015-08-16']
df1['ID_1'] = [1,1,2,3]
df1['ID_2'] = ['a', 'a', 'c', 'a']
df1['RESULT'] = ['RED', 'WHITE', 'BLUE', 'RED']
df2["DATE"] = ['2014-06-16', '2014-07-01', '2014-08-16' , '2014-08-16']
df2['ID_1'] = [1,1,2,3]
df2['ID_2'] = ['z', 'z', 'h', 'k']
df2['RESULT'] = ['WHITE', 'WHITE', 'BLUE', 'RED']
现在我需要在两者上分组“ID_1”并比较所有列(ID_2除外)是否等于。理想情况下,通过显示差异
结果应该是:
DATE ID_1 ID_2x ID2y RESULTx RESULTy
2014-06-16 1 z a WHITE RED
我试过分组如下:
grp1 = df1.groupby("ID_1")
grp2 = df2.groupby("ID_1")
for (g1,g2) in zip(grp1,grp2):
g1[1][["DATE", "RESULT"]] != g2[1][["DATE", "RESULT"]]
但我觉得效率不高。此外,我得到一个比较错误:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
关于如何进行的任何想法?
谢谢!
答案 0 :(得分:1)
重新陈述问题:你想要的是比较两个数据帧并找到其值不同的所有行(特定列除外)。这是一种方法:
cols = ['DATE', 'ID_1', 'RESULT']
cond = (df1[cols] != df2[cols]).any(axis=1)
new_df = df1[cond].merge(df2[cond], on='ID_1', how='outer', suffixes=('x','y'))
(结果与你答案中的结果略有不同,因为我对你正在寻找的一般行为并不完全确定 - 请参阅我对答案的评论。)