检查一行数据帧的值是否未包含在另一个数据帧的列中

时间:2017-05-10 06:25:21

标签: pandas

我想检查一个数据帧中某行的值是否未包含在另一个数据帧的特定列中,如果" if"条件

如果我的数据框是:

df1:
   col1 col2
0   a   e
1   b   f
2   c   g
3   d   h

df2:

   col1 col2
0   a   y
1   v   u
2   x   z
3   w   t

我想遍历df1中col1中的每一行,并检查该值是否包含在df2的col1中

我目前的代码是:

 for row, i in df1.iterrows():
    for row, j in df2.iterrows():
       if i.col1 not in j.col1:
          print("blu")

现在代码将进入if条件,即使df1的col1中的值包含在df2的col1中

任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:0)

使用isin

df1.col1.isin(df2.col1)

0     True
1    False
2    False
3    False
Name: col1, dtype: bool

答案 1 :(得分:0)

在pandas中最好避免使用iterrows的循环,因为它很慢。因此,使用非常快速的矢量化pandasnumpy函数会更好。

如果需要检查列中是否不存在 - 请将isin~一起用于反向布尔掩码:

mask = ~df1.col1.isin(df2.col1)
print (mask)

0    False
1     True
2     True
3     True
Name: col1, dtype: bool

替代解决方案是使用numpy.in1d

mask = ~np.in1d(df1.col1,df2.col1)
print (mask)
[False  True  True  True]

如果需要按行检查,请使用!=ne

mask = df1.col1 != df2.col1
#same as 
#mask = df1.col1.ne(df2.col1)
print (mask)

0    False
1     True
2     True
3     True
Name: col1, dtype: bool

或者:

mask = df1.col1.values != df2.col1.values
print (mask)
[False  True  True  True]

如果可以使用掩码新列,请使用numpy.where

df1['new'] = np.where(mask, 'a', 'b')
print (df1)
  col1 col2 new
0    a    e   b
1    b    f   a
2    c    g   a
3    d    h   a

差异更明显,有点不同DataFrames

print (df1)
  col1 col2
0    a    e
1    b    f
2    c    g
3    d    h

print (df2)
  col1 col2
0    a    y
1    v    u
2    d    z <- change value to d
3    w    t


mask = df1.col1 != df2.col1
print (mask)
0    False
1     True
2     True
3     True
Name: col1, dtype: bool
mask = ~df1.col1.isin(df2.col1)
print (mask)
0    False
1     True
2     True
3    False
Name: col1, dtype: bool

Numpy解决方案显然更快:

In [23]: %timeit (~df1.col1.isin(df2.col1))
The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 198 µs per loop

In [24]: %timeit (~np.in1d(df1.col1,df2.col1))
The slowest run took 9.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 42.5 µs per loop