我有一个包含多列的数据框。我想为每一行分配一个优先级。 该优先级将基于其他列中的数据给出。
我已经定义了优先级函数
def priority(Bcat,Brand,IPC,Customer, Type):
p=1
if Bcat != "*":
p+= len(Bcat)/3
if Brand != "*":
p+= 2
if IPC != "*":
p+= 4
if Customer != "*" & Customer != "REPLCUST":
p+= 8
if Type == "Default":
p+= -16
return p
我现在想将它应用到我的数据框中。
这就是我的数据框架(2500行):
Bcat Brand Customer IPC LOC MKT_BUD Type STARTEFF Value
A B C D E F 1 2001-01-01 1.0
我正在尝试这个,但它不起作用
df["Priority"] = df[["Bcat","Brand","IPC","Customer","Type"]].apply(priority,axis=1,args=("Bcat","Brand","IPC","Customer","Type"))
我收到此消息
TypeError: ('priority() takes 5 positional arguments but 6 were given', 'occurred at index 0')
还试过这个
df["Priority"] = np.vectorize(priority(df.Bcat,df.Brand,df.IPC,df.Customer,df.Type))
并收到此消息
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
答案 0 :(得分:3)
如果要在数据帧上使用apply,则可能需要lambda函数:
def priority(Bcat,Brand,IPC,Customer, Type):
p=1
if Bcat != "*":
p+= len(Bcat)/3
if Brand != "*":
p+= 2
if IPC != "*":
p+= 4
if (Customer != "*") & (Customer != "REPLCUST"): # Here you need brackets
p+= 8
if Type == "Default":
p+= -16
return p
df= pd.DataFrame([['A','B','C','D','E','F','1','2001-01-01','1.0']],\
columns = ['Bcat','Brand','Customer','IPC','LOC','MKT_BUD','Type','STARTEFF','Value'])
df.apply(lambda x: priority(x.Bcat,x.Brand,x.IPC,x.Customer,x.Type),axis = 1)
0 15.333333
dtype: float64
这将适用于数据框,因此它可能不是最佳的,因为它遍历行以访问df.BCat
中字符串的长度。我会寻找更高效的东西。
修改强>
否则,您可以使用str.len
执行列式操作:
df['priority'] = 1
mask = df.Bcat != "*"
df.loc[mask,'priority'] += df.loc[mask,'Bcat'].str.len()/3
df.loc[df.Brand != "*",'priority'] += 2
df.loc[df.IPC != "*",'priority'] += 4
df.loc[~df.Customer.isin(['*','REPLCUST']),'priority'] += 8
df.loc[df.Type == "Default",'priority'] -= 16
Bcat Brand Customer IPC LOC MKT_BUD Type STARTEFF Value priority
0 A B C D E F 1 2001-01-01 1.0 15.333333
当你使用Series而不是遍历行时,这会更快。
答案 1 :(得分:3)
这是同时应用于所有行的矢量化解决方案。它应该比将函数单独应用于每一行要快得多。
def priority(df):
df = df.assign(priority=1)
df['Type'] = df['Type'].astype(str)
mask = df['Bcat'] != '*'
df.loc[mask, 'priority'] += df.loc[mask, 'Bcat'].apply(len) / 3.
df.loc[df['Brand'] != '*', 'priority'] += 2
df.loc[df['IPC'] != '*', 'priority'] += 4
df.loc[~df['Customer'].isin(['*', 'REPLCUST']), 'priority'] += 8
df.loc[df['Type'] == 'Default', 'priority'] -= 16
return df
>>> priority(df)
Bcat Brand Customer IPC LOC MKT_BUD Type STARTEFF Value priority
0 A B C D E F 1 2001-01-01 1 15.333333
答案 2 :(得分:2)
正如你所提到的,申请可以在这里诀窍。
我创建了这个测试:
df = pd.DataFrame([[1,2,3], [6,7,8]], columns=[1,2,3])
def func(a, b, c):
return a + b + c
df['total'] = df.apply(lambda row: func(row[1], row[2], row[3]), axis='columns')
输出:
1 2 3 total
0 1 2 3 6
1 6 7 8 21
我对您的应用代码的修复将是:
df= pd.DataFrame([['A','B','C','D','E','F','1','2001-01-01','1.0']],\
columns = ['Bcat','Brand','Customer','IPC','LOC','MKT_BUD','Type','STARTEFF','Value'])
df['Priority'] = df.apply(lambda row: priority(row['Bcat'],
row['Brand'],
row['IPC'],
row['Customer'],
row['Type']),
axis='columns')
输出:
Bcat Brand Customer IPC LOC MKT_BUD Type STARTEFF Value Priority
0 A B C D E F 1 2001-01-01 1.0 15.333333