Question

我想检查一个数据帧中某行的值是否未包含在另一个数据帧的特定列中，如果＆＃34; if＆＃34;条件

如果我的数据框是：

df1:
   col1 col2
0   a   e
1   b   f
2   c   g
3   d   h

df2:

   col1 col2
0   a   y
1   v   u
2   x   z
3   w   t

我想遍历df1中col1中的每一行，并检查该值是否包含在df2的col1中

我目前的代码是：

 for row, i in df1.iterrows():
    for row, j in df2.iterrows():
       if i.col1 not in j.col1:
          print("blu")

现在代码将进入if条件，即使df1的col1中的值包含在df2的col1中

任何帮助将不胜感激。

Answer 1

使用isin

df1.col1.isin(df2.col1)

0     True
1    False
2    False
3    False
Name: col1, dtype: bool

Answer 2

在pandas中最好避免使用iterrows的循环，因为它很慢。因此，使用非常快速的矢量化pandas或numpy函数会更好。

如果需要检查列中是否不存在 - 请将isin与~一起用于反向布尔掩码：

mask = ~df1.col1.isin(df2.col1)
print (mask)

0    False
1     True
2     True
3     True
Name: col1, dtype: bool

替代解决方案是使用numpy.in1d：

mask = ~np.in1d(df1.col1,df2.col1)
print (mask)
[False  True  True  True]

如果需要按行检查，请使用!=或ne：

mask = df1.col1 != df2.col1
#same as 
#mask = df1.col1.ne(df2.col1)
print (mask)

0    False
1     True
2     True
3     True
Name: col1, dtype: bool

或者：

mask = df1.col1.values != df2.col1.values
print (mask)
[False  True  True  True]

如果可以使用掩码新列，请使用numpy.where：

df1['new'] = np.where(mask, 'a', 'b')
print (df1)
  col1 col2 new
0    a    e   b
1    b    f   a
2    c    g   a
3    d    h   a

差异更明显，有点不同DataFrames：

print (df1)
  col1 col2
0    a    e
1    b    f
2    c    g
3    d    h

print (df2)
  col1 col2
0    a    y
1    v    u
2    d    z <- change value to d
3    w    t


mask = df1.col1 != df2.col1
print (mask)
0    False
1     True
2     True
3     True
Name: col1, dtype: bool

mask = ~df1.col1.isin(df2.col1)
print (mask)
0    False
1     True
2     True
3    False
Name: col1, dtype: bool

Numpy解决方案显然更快：

In [23]: %timeit (~df1.col1.isin(df2.col1))
The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 198 µs per loop

In [24]: %timeit (~np.in1d(df1.col1,df2.col1))
The slowest run took 9.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 42.5 µs per loop

检查一行数据帧的值是否未包含在另一个数据帧的列中

2 个答案: