Pandas:根据多个其他列创建列。申请失败()

时间:2017-08-23 15:17:29

标签: python pandas dataframe apply

我有一个包含多列的数据框。我想为每一行分配一个优先级。 该优先级将基于其他列中的数据给出。

我已经定义了优先级函数

def priority(Bcat,Brand,IPC,Customer, Type):
    p=1
    if Bcat != "*":
        p+= len(Bcat)/3
    if Brand != "*":
        p+= 2
    if IPC != "*":
        p+= 4
    if Customer != "*" & Customer != "REPLCUST":
        p+= 8
    if Type == "Default":
        p+= -16
    return p

我现在想将它应用到我的数据框中。

这就是我的数据框架(2500行):

Bcat Brand Customer   IPC   LOC MKT_BUD      Type   STARTEFF    Value
A    B     C          D      E   F            1     2001-01-01    1.0

我正在尝试这个,但它不起作用

df["Priority"] = df[["Bcat","Brand","IPC","Customer","Type"]].apply(priority,axis=1,args=("Bcat","Brand","IPC","Customer","Type"))

我收到此消息

TypeError: ('priority() takes 5 positional arguments but 6 were given', 'occurred at index 0')

还试过这个

df["Priority"] = np.vectorize(priority(df.Bcat,df.Brand,df.IPC,df.Customer,df.Type))

并收到此消息

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

3 个答案:

答案 0 :(得分:3)

如果要在数据帧上使用apply,则可能需要lambda函数:

def priority(Bcat,Brand,IPC,Customer, Type):
    p=1
    if Bcat != "*":
        p+= len(Bcat)/3
    if Brand != "*":
        p+= 2
    if IPC != "*":
        p+= 4
    if (Customer != "*") & (Customer != "REPLCUST"): # Here you need brackets
        p+= 8
    if Type == "Default":
        p+= -16
    return p

df= pd.DataFrame([['A','B','C','D','E','F','1','2001-01-01','1.0']],\
     columns = ['Bcat','Brand','Customer','IPC','LOC','MKT_BUD','Type','STARTEFF','Value'])

df.apply(lambda x: priority(x.Bcat,x.Brand,x.IPC,x.Customer,x.Type),axis = 1)

0    15.333333
dtype: float64

这将适用于数据框,因此它可能不是最佳的,因为它遍历行以访问df.BCat中字符串的长度。我会寻找更高效的东西。

修改

否则,您可以使用str.len执行列式操作:

df['priority'] = 1
mask = df.Bcat != "*"
df.loc[mask,'priority'] += df.loc[mask,'Bcat'].str.len()/3
df.loc[df.Brand != "*",'priority'] += 2
df.loc[df.IPC != "*",'priority'] += 4
df.loc[~df.Customer.isin(['*','REPLCUST']),'priority'] += 8
df.loc[df.Type == "Default",'priority'] -= 16

    Bcat    Brand   Customer    IPC LOC MKT_BUD Type  STARTEFF    Value priority
0   A       B       C           D   E   F       1     2001-01-01  1.0   15.333333

当你使用Series而不是遍历行时,这会更快。

答案 1 :(得分:3)

这是同时应用于所有行的矢量化解决方案。它应该比将函数单独应用于每一行要快得多。

def priority(df):
    df = df.assign(priority=1)
    df['Type'] = df['Type'].astype(str)
    mask = df['Bcat'] != '*'
    df.loc[mask, 'priority'] += df.loc[mask, 'Bcat'].apply(len) / 3.
    df.loc[df['Brand'] != '*', 'priority'] += 2
    df.loc[df['IPC'] != '*', 'priority'] += 4
    df.loc[~df['Customer'].isin(['*', 'REPLCUST']), 'priority'] += 8
    df.loc[df['Type'] == 'Default', 'priority'] -= 16
    return df

>>> priority(df)
  Bcat Brand Customer IPC LOC MKT_BUD Type    STARTEFF  Value   priority
0    A     B        C   D   E       F    1  2001-01-01      1  15.333333

答案 2 :(得分:2)

正如你所提到的,申请可以在这里诀窍。

我创建了这个测试:

df = pd.DataFrame([[1,2,3], [6,7,8]], columns=[1,2,3])
def func(a, b, c):
    return a + b + c
df['total'] = df.apply(lambda row: func(row[1], row[2], row[3]), axis='columns')

输出:

    1   2   3   total
0   1   2   3   6
1   6   7   8   21

我对您的应用代码的修复将是:

df= pd.DataFrame([['A','B','C','D','E','F','1','2001-01-01','1.0']],\
     columns = ['Bcat','Brand','Customer','IPC','LOC','MKT_BUD','Type','STARTEFF','Value'])


df['Priority'] = df.apply(lambda row: priority(row['Bcat'], 
                                               row['Brand'], 
                                               row['IPC'], 
                                               row['Customer'], 
                                               row['Type']), 
                          axis='columns')

输出:

    Bcat   Brand    Customer    IPC LOC MKT_BUD Type  STARTEFF     Value    Priority
0   A      B        C           D   E   F        1    2001-01-01    1.0     15.333333