从源表中获取未合并的数据

时间:2017-11-07 13:58:28

标签: python pandas dataframe merge

我有两个数据帧:

df1 = (data = {'Invoice' : [1, 2, 3, 4, 5],
                            'Value' : [10, 25, 40, 10, 15]})

     Invoice    Value  Param
0        1     10      0
1        2     25      0
2        3     40      0
3        4     10      0
4        5     15      0

df2 = (data = {'Invoice' : [2, 3, 5, 2],
                'Value' : [25, 15, 15,25],
                'TestData': ["A",'B','C','D']})

   Invoice    TestData  Value  Param
0        2        A     25      0
1        3        B     15      0
2        5        C     15      0
3        2        D     25      1

我不想在df1中合并两次df2发票2和单次出现,所以:

df1["Param"]=df1.groupby(["Invoice","Value"]).cumcount()
df2["Param"]=df2.groupby(["Invoice","Value"]).cumcount()

合并后:

df3 = (df1,df2, left_on=["Invoice","Value","Param"], right_on=["Invoice","Value","Param"])`

有最终合并的数据框:

     Invoice    Value   Param TestData 
0     2      25      0      A 
1     5      15      0      C

现在我想从df1获取未合并的数据:

df1[(~df1.Invoice.isin(df3.Invoice))&(~df1.Value.isin(df3.Value))]`

它适用于df1:

    Invoice Value   Param
0   1     10      0
2   3     40      0
3   4     10      0

但df2失败,其中result为空数据帧:

df2[(~df2.Value.isin(df3.Value))&(~df2.Invoice.isin(df3.Invoice))]`

据我检查一下,我认为比较运行“两次”而不是一次(&运算符) - 首先是代码检查每张发票的编号,然后再次检查值独立而不是同时检查两个条件。

您是否知道如何以这种形状获得未合并的df2:

         Invoice   TestData   Value   Param
1        3        B     15      0
3        2        D     25      1

1 个答案:

答案 0 :(得分:2)

更新了" Param"添加到合并

这样做的一种方法是找到" unmerged"来自两个数据帧的数据都是使用how='outer'indicator=True。这将显示一个数据框,其中列_merge包含三个值,'两者都是'表示成功合并,' left_only'意味着" unmerged"来自df1和' right_only'的数据意味着" unmerged"来自df2的数据。

示例:

df1 = pd.DataFrame(data = {'Invoice' : [1, 2, 3, 4, 5],
                            'Value' : [10, 25, 40, 10, 15]})

df2 = pd.DataFrame(data = {'Invoice' : [2, 3, 5, 2],
                'Value' : [25, 15, 15,25],
                'TestData': ["A",'B','C','D']})
df1["Param"]=df1.groupby(["Invoice","Value"]).cumcount()
df2["Param"]=df2.groupby(["Invoice","Value"]).cumcount()

df3 = pd.merge(df1,df2, left_on=["Invoice","Value","Param"], 
               right_on=["Invoice","Value","Param"],
               how='outer', indicator=True)

df3

输出:

   Invoice  Value  Param TestData      _merge
0        1     10      0      NaN   left_only
1        2     25      0        A        both
2        3     40      0      NaN   left_only
3        4     10      0      NaN   left_only
4        5     15      0        C        both
5        3     15      0        B  right_only
6        2     25      1        D  right_only

获取内部联接完整合并数据:

df3.query('_merge == "both"')

输出:

   Invoice  Value  Param TestData _merge
1        2     25      0        A   both
4        5     15      0        C   both

获得"未合并"来自df1的数据

df3.query('_merge == "left_only"')

   Invoice  Value  Param TestData     _merge
0        1     10      0      NaN  left_only
2        3     40      0      NaN  left_only
3        4     10      0      NaN  left_only

并且,从df2获取"未合并的数据

df3.query('_merge == "right_only"')

   Invoice  Value  Param TestData      _merge
5        3     15      0        B  right_only
6        2     25      1        D  right_only