Pandas:通过多列查找另一个DataFrame中不存在的行

时间:2015-09-18 13:02:04

标签: python join pandas

与此python pandas: how to find rows in one dataframe but not in another?相同 但有多列

这是设置:

SiteV2

现在,我想从import pandas as pd df = pd.DataFrame(dict( col1=[0,1,1,2], col2=['a','b','c','b'], extra_col=['this','is','just','something'] )) other = pd.DataFrame(dict( col1=[1,2], col2=['b','c'] )) 中选择其他行中不存在的行。我希望通过dfcol1

进行选择

在SQL中我会这样做:

col2

在熊猫我可以做这样的事情,但感觉非常难看。如果df具有id-column,则可以避免部分丑陋,但并不总是可用。

select * from df 
where not exists (
    select * from other o 
    where df.col1 = o.col1 and 
    df.col2 = o.col2
)

所以也许有一些更优雅的方式?

2 个答案:

答案 0 :(得分:24)

Since 0.17.0 there is a new indicator param you can pass to merge which will tell you whether the rows are only present in left, right or both:

In [5]:
merged = df.merge(other, how='left', indicator=True)
merged

Out[5]:
   col1 col2  extra_col     _merge
0     0    a       this  left_only
1     1    b         is       both
2     1    c       just  left_only
3     2    b  something  left_only

In [6]:    
merged[merged['_merge']=='left_only']

Out[6]:
   col1 col2  extra_col     _merge
0     0    a       this  left_only
2     1    c       just  left_only
3     2    b  something  left_only

So you can now filter the merged df by selecting only 'left_only' rows

答案 1 :(得分:4)

有趣

cols = ['col1','col2']
#get copies where the indeces are the columns of interest
df2 = df.set_index(cols)
other2 = other.set_index(cols)
#Look for index overlap, ~
df[~df2.index.isin(other2.index)]

返回:

    col1 col2  extra_col
0     0    a       this
2     1    c       just
3     2    b  something

看起来更优雅......