熊猫:自定义WMAPE函数聚合函数到多个列而没有for循环?

时间:2019-02-22 16:29:49

标签: python pandas pandas-groupby forecasting pandas-apply

目标:在多个预测列和一个实际数据列上使用自定义WMAPE(加权平均绝对百分比误差)功能对熊猫数据帧进行分组,而无for循环。我知道for循环&输出数据帧的合并将解决问题。我想有效地做到这一点。

具有:WMAPE函数,在数据帧的一个预测列上成功使用WMAPE函数。一列实际数据,可变数量的预测列。

输入数据: Pandas DataFrame具有多个分类列(城市,人,DT,小时),一个实际数据列(实际)和四个预测列(Forecast_1 ... Forecast_4)。请参阅csv的链接: https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1

需要:WMAPE功能在分组依据期间应用于多列,并将预测列的列表馈入分组依据行。

所需的输出:具有类别组列和WMAPE的所有列的输出数据框。标签是首选,但不是必需的(下面的输出图像)。

到目前为止成功的代码: WMAPE的两个功能:一个用于获取两个系列并输出单个浮点值(wmape),另一个用于组别(wmape_gr):

def wmape(actual, forecast):
    # we take two series and calculate an output a wmape from it

    # make a series called mape
    se_mape = abs(actual-forecast)/actual

    # get a float of the sum of the actual
    ft_actual_sum = actual.sum()

    # get a series of the multiple of the actual & the mape
    se_actual_prod_mape = actual * se_mape

    # summate the prod of the actual and the mape
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum()

    # float: wmape of forecast
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

    # return a float
    return ft_wmape_forecast

def wmape_gr(df_in, st_actual, st_forecast):
    # we take two series and calculate an output a wmape from it

    # make a series called mape
    se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]

    # get a float of the sum of the actual
    ft_actual_sum = df_in[st_actual].sum()

    # get a series of the multiple of the actual & the mape
    se_actual_prod_mape = df_in[st_actual] * se_mape

    # summate the prod of the actual and the mape
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum()

    # float: wmape of forecast
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

    # return a float
    return ft_wmape_forecast

# read in data directly from Dropbox
df = pd.read_csv('https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1',sep=",",header=0)

# grouping with 3 columns. wmape_gr uses the Actual column, and Forecast_1 as inputs
df_gr = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')

输出看起来像(前两行):

enter image description here

所需的输出将使所有预测合而为一(Forecast_2的虚拟数据... Forecast_4)。我可以 已经 使用for循环来做到这一点。我只想在groupby中完成。我想四次调用wmape函数。我将不胜感激。

3 个答案:

答案 0 :(得分:4)

这是一个非常好的问题,展示了如何优化pandas中的groupby.apply。我用两种原理来解决这些问题。

  1. 任何与组无关的计算都不应在groupby内进行
  2. 如果有内置的groupby方法,请先使用它,然后再使用 申请

让我们逐行浏览wmape_gr函数。

se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]

此行完全独立于任何组。您应该在申请范围之外进行此计算。下面,我为每个预测列执行此操作:

df['actual_forecast_diff_1'] = (df['Actual'] - df['Forecast_1']).abs() / df['Actual']
df['actual_forecast_diff_2'] = (df['Actual'] - df['Forecast_2']).abs() / df['Actual']
df['actual_forecast_diff_3'] = (df['Actual'] - df['Forecast_3']).abs() / df['Actual']
df['actual_forecast_diff_4'] = (df['Actual'] - df['Forecast_4']).abs() / df['Actual']

让我们看一下下一行:

ft_actual_sum = df_in[st_actual].sum()

此行取决于组,因此我们必须在此处使用groupby,但是不必将其放在apply函数中。稍后将对其进行计算。

我们移至下一行:

se_actual_prod_mape = df_in[st_actual] * se_mape

这再次独立于组。让我们在整个DataFrame上对其进行计算。

df['forecast1_wampe'] = df['actual_forecast_diff_1'] *  df['Actual']
df['forecast2_wampe'] = df['actual_forecast_diff_2'] *  df['Actual']
df['forecast3_wampe'] = df['actual_forecast_diff_3'] *  df['Actual']
df['forecast4_wampe'] = df['actual_forecast_diff_4'] *  df['Actual']

让我们进入最后两行:

ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum

这些行再次取决于组,但是我们仍然不需要使用apply。现在,我们独立于该组分别计算了4个“ forecast_wampe”列。我们只需要对每个组的每个求和。 “实际”列也是如此。

我们可以运行两个单独的groupby操作来汇总这些列,如下所示:

g = df.groupby(['City', 'Person', 'DT'])
actual_sum = g['Actual'].sum()
forecast_wampe_cols = ['forecast1_wampe', 'forecast2_wampe', 'forecast3_wampe', 'forecast4_wampe']
forecast1_wampe_sum = g[forecast_wampe_cols].sum()

我们获得以下Series和DataFrame

enter image description here

enter image description here

然后,我们只需要按系列将DataFrame中的每一列分开即可。我们需要使用div方法来更改除法的方向,以使索引对齐

forecast1_wampe_sum.div(actual_sum, axis='index')

这将返回我们的答案:

enter image description here

答案 1 :(得分:2)

如果您修改wmape以使用广播来处理数组,则可以一次完成:

def wmape(actual, forecast):
    # Take a series (actual) and a dataframe (forecast) and calculate wmape
    # for each forecast. Output shape is (1, num_forecasts)

    # Convert to numpy arrays for broadasting
    forecast = np.array(forecast.values)
    actual=np.array(actual.values).reshape((-1, 1))

    # Make an array of mape (same shape as forecast)
    se_mape = abs(actual-forecast)/actual

    # Calculate sum of actual values
    ft_actual_sum = actual.sum(axis=0)

    # Multiply the actual values by the mape
    se_actual_prod_mape = actual * se_mape

    # Take the sum of the product of actual values and mape
    # Make sure to sum down the rows (1 for each column)
    ft_actual_prod_mape_sum = se_actual_prod_mape.sum(axis=0)

    # Calculate the wmape for each forecast and return as a dictionary
    ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
    return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)}

然后在适当的列上使用apply

# Group the dataframe and apply the function to appropriate columns
new_df = df.groupby(['City', 'Person', 'DT']).apply(lambda x: wmape(x['Actual'], 
                                        x[[c for c in x if 'Forecast' in c]])).\
            to_frame().reset_index()

这将导致一个带有单个字典列的数据框。 Intermediate Results

可以将单列转换为正确格式的多列:

# Convert the dictionary in a single column into 4 columns with proper names
# and concantenate column-wise
df_grp = pd.concat([new_df.drop(columns=[0]), 
                    pd.DataFrame(list(new_df[0].values))], axis=1)

结果:

Result of operations

答案 2 :(得分:1)

不更改功能

申请四次

df_gr1 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')
df_gr2 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_2')
df_gr3 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_3')
df_gr4 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_4')

一起加入他们

all1= pd.concat([df_gr1, df_gr2,df_gr3,df_gr4],axis=1, sort=False)

获取城市,人和DT的列

all1['city']= [all1.index[i][0]  for i in range(len(df_gr1))]
all1['Person']= [all1.index[i][1]  for i in range(len(df_gr1))]
all1['DT']= [all1.index[i][2]  for i in range(len(df_gr1))]

重命名列并更改顺序

df = all1.rename(columns={0:'Forecast_1_wmape', 1:'Forecast_2_wmape',2:'Forecast_3_wmape',3:'Forecast_4_wmape'})

df = df[['city','Person','DT','Forecast_1_wmape','Forecast_2_wmape','Forecast_3_wmape','Forecast_4_wmape']]

df=df.reset_index(drop=True)