Question

在工作on an answer to another question时，我偶然发现了意外的行为：

考虑以下DataFrame：

df = pd.DataFrame({
    'A':list('AAcdef'),
    'B':[4,5,4,5,5,4],
    'E':[5,3,6,9,2,4],
    'F':list('BaaBbA')
})
print(df)

   A  B  E  F
0  A  4  5  B  #<— row contains 'A' and 5
1  A  5  3  a  #<— row contains 'A' and 5
2  c  4  6  a
3  d  5  9  B
4  e  5  2  b
5  f  4  4  A

如果我们尝试查找包含['A', 5]的所有列，则可以使用jezrael's answer：

cond = [['A'],[5]]
print( np.logical_and.reduce([df.isin(x).any(1) for x in cond]) )

（正确地）产生：[ True True False False False False]

但是，如果我们使用：

cond = [['A'],[5]]
print( df.apply(lambda x: np.isin([cond],[x]).all(),axis=1) )

这将产生：

0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

仔细检查第二次尝试发现：

np.isin(['A',5],df.loc[0]) “错误地” 产生array([ True, False])，这可能是由于numpy推断了dtype <U1，因此5!='5' < / li>
np.isin(['A',5],['A',4,5,'B']) “正确” 产生array([ True, True])，这意味着我们可以（并且应该）在{{ 1}}方法

问题，已简化：

为什么我需要在一种情况下指定df.loc[0].values.tolist()，而在另一种情况下可以直接使用.apply()？

x.values.tolist()

编辑：

更糟糕的是，如果我们搜索x：

print( np.logical_and.reduce([df.isin(x).any(1) for x in cond]) )
print( df.apply(lambda x: np.isin([cond],x.values.tolist()).all(),axis=1 ) )

Answer 1

我认为在DataFrame中，数字是带有整数子项的混合数字，因此如果按行循环获得具有混合类型的Series，那么numpy会将其强制转换为strings。

可能的解决方案将转换为数组，然后转换为string中的cond值：

cond = [[4],[5]]

print(df.apply(lambda x: np.isin(np.array(cond).astype(str), x.values.tolist()).all(),axis=1))
0     True
1    False
2    False
3    False
4    False
5    False
dtype: bool

不幸的是，对于一般解决方案（如果可能，仅数字列）需要同时转换-cond和Series：

f = lambda x: np.isin(np.array(cond).astype(str), x.astype(str).tolist()).all()
print (df.apply(f, axis=1))

或所有数据：

f = lambda x: np.isin(np.array(cond).astype(str), x.tolist()).all()
print (df.astype(str).apply(f, axis=1))

如果在纯python中使用设置，则效果很好：

print(df.apply(lambda x: set([4,5]).issubset(x),axis=1) )
0     True
1    False
2    False
3    False
4    False
5    False
dtype: bool

print(df.apply(lambda x: set(['A',5]).issubset(x),axis=1) )
0     True
1     True
2    False
3    False
4    False
5    False
dtype: bool

Answer 2

因为

df.isin适用于pd.Series，而np.isin不适用。
pd.loc返回一个pd.Series。
要将pd.Series转换为数组，x.values.tolist()应该可以正常工作。

在熊猫数据框上应用np.isin（）的意外行为

2 个答案: