我正在尝试在现有数据框架上生成一个新列,该列是根据条件语句构建的,输入内容是该数据帧中多个列的数据。
我正在阅读np.select()方法,因为这是将多列用作条件级别输入的最佳方法。但是,当我运行代码时,即使满足行中的条件,也会填充默认值。以下是一些exepelpel代码
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,2, size=(20,3)), columns = list('ABC'))
choices = ['C Highest','B Highest','A Highest']
conditions = [
(df['C'] is True),
(df['C'] is False & df['B'] is True),
(df['A'] is True & df['C']is False & df['B'] is False)]
#conditions = [
# (df['C'] == 1),
# (df['C'] == 0 & df['B'] == 1),
# (df['A'] == 1 & df['C'] == 0 & df['B'] == 0)]
df['Highest Column'] = np.select(conditions, choices, default=np.nan)
运行上面的代码时,没有任何错误,但是数据框中的最高列全为NaN。就像代码可以工作一样,但似乎没有一个条件可以满足(尽管条件为true),因此仅填充默认值。
当我将条件切换到已注释掉的条件(然后注释掉之前的条件变量)时,我得到"ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()."
显然,这些数据是随机的,并且是从我的用例中抽象出来的,但是底层代码应该几乎相同。如果列C中有1,则应在数据框中的最高列系列中将其标记为列C。如果C列为0,但B列为1,则最高应该是B列。依此类推。
我知道我可以很快地在excel中做到这一点,但是我宁愿学习如何在Python / pandas中做到这一点,所以任何建议都将不胜感激!
答案 0 :(得分:1)
尝试:
choices = ['C Highest','B Highest','A Highest']
conditions = [
(df['C'] == 1),
((df['C'] == 0) & (df['B'] == 1)),
((df['A'] == 1) & (df['C'] == 0) & (df['B'] == 0))]
df['Highest Column'] = np.select(conditions, choices, default=np.nan)
# df.head()
A B C Highest Column
0 1 0 0 A Highest
1 0 0 1 C Highest
2 1 1 0 B Highest
3 1 0 1 C Highest
4 1 1 0 B Highest