我有两个数据帧:
df1:
col1 col2
1 2
1 3
2 4
df2:
col1
2
3
我想提取df1
df1
的{{1}} col2
not in
的{{1}} df2
中的所有行。所以在这种情况下它将是:
col1
我第一次尝试:
col1 col2
2 4
但它回来了:
TypeError:'Series'对象是可变的,因此它们不能被散列
然后我尝试了:
df1[df1['col2'] not in df2['col1']]
但它回来了:
TypeError:'instancemethod'类型的参数不可迭代
答案 0 :(得分:1)
您可以将isin
与~
一起用于反转布尔值掩码:
print (df1['col2'].isin(df2['col1']))
0 True
1 True
2 False
Name: col2, dtype: bool
print (~df1['col2'].isin(df2['col1']))
0 False
1 False
2 True
Name: col2, dtype: bool
print (df1[~df1['col2'].isin(df2['col1'])])
col1 col2
2 2 4
<强>计时强>:
In [8]: %timeit (df1.query('col2 not in @df2.col1'))
1000 loops, best of 3: 1.57 ms per loop
In [9]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1000 loops, best of 3: 466 µs per loop
答案 1 :(得分:1)
使用.query()方法:
In [9]: df1.query('col2 not in @df2.col1')
Out[9]:
col1 col2
2 2 4
更大的DF的时间安排:
In [44]: df1.shape
Out[44]: (30000000, 2)
In [45]: df2.shape
Out[45]: (20000000, 1)
In [46]: %timeit (df1[~df1['col2'].isin(df2['col1'])])
1 loop, best of 3: 5.56 s per loop
In [47]: %timeit (df1.query('col2 not in @df2.col1'))
1 loop, best of 3: 5.96 s per loop