有条件地根据另一个Pandas数据帧提取Pandas行

时间:2016-09-13 19:09:22

标签: python pandas indexing dataframe conditional-statements

我有两个数据帧:

df1:

col1    col2
1       2
1       3
2       4

df2:

col1
2
3

我想提取df1 df1的{​​{1}} col2 not in的{​​{1}} df2中的所有行。所以在这种情况下它将是:

col1

我第一次尝试:

col1    col2
2       4

但它回来了:

  

TypeError:'Series'对象是可变的,因此它们不能被散列

然后我尝试了:

df1[df1['col2'] not in df2['col1']]

但它回来了:

  

TypeError:'instancemethod'类型的参数不可迭代

2 个答案:

答案 0 :(得分:1)

您可以将isin~一起用于反转布尔值掩码:

print (df1['col2'].isin(df2['col1']))
0     True
1     True
2    False
Name: col2, dtype: bool

print (~df1['col2'].isin(df2['col1']))
0    False
1    False
2     True
Name: col2, dtype: bool

print (df1[~df1['col2'].isin(df2['col1'])])
   col1  col2
2     2     4

<强>计时

In [8]: %timeit (df1.query('col2 not in @df2.col1'))
1000 loops, best of 3: 1.57 ms per loop

In [9]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1000 loops, best of 3: 466 µs per loop

答案 1 :(得分:1)

使用.query()方法:

In [9]: df1.query('col2 not in @df2.col1')
Out[9]:
   col1  col2
2     2     4

更大的DF的时间安排:

In [44]: df1.shape
Out[44]: (30000000, 2)

In [45]: df2.shape
Out[45]: (20000000, 1)

In [46]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1 loop, best of 3: 5.56 s per loop

In [47]: %timeit (df1.query('col2 not in @df2.col1'))
1 loop, best of 3: 5.96 s per loop