如何优化python Pandas迭代?

时间:2019-05-06 14:15:56

标签: python pandas performance optimization

我有一个像这样的数据框:

''' df: 
        index, sales_fraction, Selected, T_value, A_value, D_value
        1       0.33            t          0.3343   0.33434   0.33434 
        2       0.45            a          0.3434   0.23232   0.33434 
        3       0.56            d          0.3434   0.33434   0.6767
        4       0.545           t          0.3434   0.33434   0.3346
        5       0.343           d          0.2323   0.96342   0.2323
''' 

我有这样的功能:

def aggregation(df):       

            df['sales_fraction'] = df['volume']/df['volume'].sum()
            res = 0
            for ix, row in df.iterrows():
                if row['Selected'] == 't':
                    res += row['sales_fraction'] * row['T_value']
                elif row['Selected'] == 'a':
                    res += row['sales_fraction'] * row['A_value']
                elif row['Selected'] == 'd':
                    res += row['sales_fraction'] * row['D_value']                    

            return res

它运行非常慢,因为我需要在另一个函数中使用聚合函数数百万次。有什么建议可以优化我的代码吗?非常感谢您的帮助。谢谢!

5 个答案:

答案 0 :(得分:1)

您可以使用np.selectnp.sum

cond1 = df['Selected'] == 't' 
cond2= df['Selected'] =='a'
cond3 = df['Selected']=='d'
val1 = df['sales_fraction'] * df['T_value']
val2 = df['sales_fraction'] * df['a_value']
val3 = df['sales_fraction'] * df['D_value']
conditions = [cond1, cond2, cond3]
values = [val1, val2, val3]

res = np.sum(np.select(conditions, values))

np.select可以接受多个条件,并为这些条件返回相应的值。因此,您可以拥有一个conditions列表和一个values列表并将其传递给np.select。然后np.sum将返回所有值的总和

答案 1 :(得分:1)

我正在使用lookup

s=df.loc[:,'T_value':]
s.columns=s.columns.str.split('_').str[0]
np.sum(df.sales_fraction*s.lookup(s.index,df.Selected.str.upper()))
Out[1421]: 0.8606469

答案 2 :(得分:1)

尝试pd.get_dummies()

weights = pd.get_dummies(df.Selected)[['t','a', 'd']]
selected = (df[['T_value', 'A_value', 'D_value']].values * weights.values).sum(1)
(selected * df['sales_fraction']).sum()

# 0.8606469

答案 3 :(得分:1)

此功能使用查找和求和

def aggregation(df):  
    return sum(df.lookup(df.index, df['Selected'].str.upper() +'_value')*df['sales_fraction'])

答案 4 :(得分:1)

如果我正确理解了您的计算方式,那么我建议您尝试使用此行代码,并将其与函数结果进行比较(一切都是内联的):

(df.loc[df["Selected"] == 't',"T_value"] * df.loc[df["Selected"] == 
't',"sales_fraction"]).sum() + (df.loc[df["Selected"] == 'a',"A_value"] * 
df.loc[df["Selected"] == 'a',"sales_fraction"]).sum()+(df.loc[df["Selected"] == 
'd',"D_value"] * df.loc[df["Selected"] == 'd',"sales_fraction"]).sum()