我有两个数据帧:
df1 = (data = {'Invoice' : [1, 2, 3, 4, 5],
'Value' : [10, 25, 40, 10, 15]})
Invoice Value Param
0 1 10 0
1 2 25 0
2 3 40 0
3 4 10 0
4 5 15 0
df2 = (data = {'Invoice' : [2, 3, 5, 2],
'Value' : [25, 15, 15,25],
'TestData': ["A",'B','C','D']})
Invoice TestData Value Param
0 2 A 25 0
1 3 B 15 0
2 5 C 15 0
3 2 D 25 1
我不想在df1中合并两次df2发票2和单次出现,所以:
df1["Param"]=df1.groupby(["Invoice","Value"]).cumcount()
df2["Param"]=df2.groupby(["Invoice","Value"]).cumcount()
合并后:
df3 = (df1,df2, left_on=["Invoice","Value","Param"], right_on=["Invoice","Value","Param"])`
有最终合并的数据框:
Invoice Value Param TestData
0 2 25 0 A
1 5 15 0 C
现在我想从df1获取未合并的数据:
df1[(~df1.Invoice.isin(df3.Invoice))&(~df1.Value.isin(df3.Value))]`
它适用于df1:
Invoice Value Param
0 1 10 0
2 3 40 0
3 4 10 0
但df2失败,其中result为空数据帧:
df2[(~df2.Value.isin(df3.Value))&(~df2.Invoice.isin(df3.Invoice))]`
据我检查一下,我认为比较运行“两次”而不是一次(&运算符) - 首先是代码检查每张发票的编号,然后再次检查值独立而不是同时检查两个条件。
您是否知道如何以这种形状获得未合并的df2:
Invoice TestData Value Param
1 3 B 15 0
3 2 D 25 1
答案 0 :(得分:2)
这样做的一种方法是找到" unmerged"来自两个数据帧的数据都是使用how='outer'
和indicator=True
。这将显示一个数据框,其中列_merge
包含三个值,'两者都是'表示成功合并,' left_only'意味着" unmerged"来自df1和' right_only'的数据意味着" unmerged"来自df2的数据。
示例:
df1 = pd.DataFrame(data = {'Invoice' : [1, 2, 3, 4, 5],
'Value' : [10, 25, 40, 10, 15]})
df2 = pd.DataFrame(data = {'Invoice' : [2, 3, 5, 2],
'Value' : [25, 15, 15,25],
'TestData': ["A",'B','C','D']})
df1["Param"]=df1.groupby(["Invoice","Value"]).cumcount()
df2["Param"]=df2.groupby(["Invoice","Value"]).cumcount()
df3 = pd.merge(df1,df2, left_on=["Invoice","Value","Param"],
right_on=["Invoice","Value","Param"],
how='outer', indicator=True)
df3
输出:
Invoice Value Param TestData _merge
0 1 10 0 NaN left_only
1 2 25 0 A both
2 3 40 0 NaN left_only
3 4 10 0 NaN left_only
4 5 15 0 C both
5 3 15 0 B right_only
6 2 25 1 D right_only
获取内部联接完整合并数据:
df3.query('_merge == "both"')
输出:
Invoice Value Param TestData _merge
1 2 25 0 A both
4 5 15 0 C both
获得"未合并"来自df1的数据
df3.query('_merge == "left_only"')
Invoice Value Param TestData _merge
0 1 10 0 NaN left_only
2 3 40 0 NaN left_only
3 4 10 0 NaN left_only
并且,从df2获取"未合并的数据
df3.query('_merge == "right_only"')
Invoice Value Param TestData _merge
5 3 15 0 B right_only
6 2 25 1 D right_only