使用np.select基于其他多个列中的数据生成条件列

时间:2019-08-09 22:04:48

标签: python pandas numpy

我正在尝试在现有数据框架上生成一个新列,该列是根据条件语句构建的,输入内容是该数据帧中多个列的数据。

我正在阅读np.select()方法,因为这是将多列用作条件级别输入的最佳方法。但是,当我运行代码时,即使满足行中的条件,也会填充默认值。以下是一些exepelpel代码

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,2, size=(20,3)), columns = list('ABC'))

choices = ['C Highest','B Highest','A Highest']
conditions = [
        (df['C'] is True), 
        (df['C'] is False & df['B'] is True),
        (df['A'] is True & df['C']is False & df['B'] is False)]

#conditions = [
#        (df['C'] == 1), 
#        (df['C'] == 0 & df['B'] == 1),
#        (df['A'] == 1 & df['C'] == 0 & df['B'] == 0)]

df['Highest Column'] = np.select(conditions, choices, default=np.nan)

运行上面的代码时,没有任何错误,但是数据框中的最高列全为NaN。就像代码可以工作一样,但似乎没有一个条件可以满足(尽管条件为true),因此仅填充默认值。

当我将条件切换到已注释掉的条件(然后注释掉之前的条件变量)时,我得到"ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."

显然,这些数据是随机的,并且是从我的用例中抽象出来的,但是底层代码应该几乎相同。如果列C中有1,则应在数据框中的最高列系列中将其标记为列C。如果C列为0,但B列为1,则最高应该是B列。依此类推。

我知道我可以很快地在excel中做到这一点,但是我宁愿学习如何在Python / pandas中做到这一点,所以任何建议都将不胜感激!

1 个答案:

答案 0 :(得分:1)

尝试:

choices = ['C Highest','B Highest','A Highest']
conditions = [
       (df['C'] == 1), 
       ((df['C'] == 0) & (df['B'] == 1)),
       ((df['A'] == 1) & (df['C'] == 0) & (df['B'] == 0))]

df['Highest Column'] = np.select(conditions, choices, default=np.nan)
# df.head()

    A   B   C   Highest Column
0   1   0   0   A Highest
1   0   0   1   C Highest
2   1   1   0   B Highest
3   1   0   1   C Highest
4   1   1   0   B Highest