计算pandas数据帧中与上一年/预测的差异

时间:2013-08-27 18:22:14

标签: python pandas dataframe

我希望比较多个模型运行的输出,计算这些值:

  1. 当期收入与上期间的差异
  2. 实际当期收入与预测当期收入之间的差异
  3. 我已经尝试过多指数,并怀疑答案在于那个方向带有一些创造性的转变()。但是,我担心通过随意应用各种枢轴/熔化/组合实验来解决问题。也许你可以帮我弄清楚如何解决这个问题:

    import pandas as pd
    
    ids = [1,2,3] * 5
    year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015']
    run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual']
    
    revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190]
    
    change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140]
    change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40]
    
    d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue}
    
    df = pd.DataFrame(data=d, columns=['ids','year','run','revenue'])
    print df
    
        ids  year       run  revenue
    0     1  2013    actual       10
    1     2  2013    actual       20
    2     3  2013    actual       20
    3     1  2014  forecast       30
    4     2  2014  forecast       50
    5     3  2014  forecast       90
    6     1  2014    actual       10
    7     2  2014    actual       40
    8     3  2014    actual       50
    9     1  2015  forecast      120
    10    2  2015  forecast      210
    11    3  2015  forecast      150
    12    1  2015    actual      130
    13    2  2015    actual      100
    14    3  2015    actual      190
    

    ....进入这个:

        ids  year       run  revenue chg_from_prev_year chg_from_forecast
    0     1  2013    actual       10                 NA                NA
    1     2  2013    actual       20                 NA                NA
    2     3  2013    actual       20                 NA                NA
    3     1  2014  forecast       30                 20                NA
    4     2  2014  forecast       50                 30                NA
    5     3  2014  forecast       90                 70                NA
    6     1  2014    actual       10                  0               -20
    7     2  2014    actual       40                 20               -10
    8     3  2014    actual       50                 30               -40
    9     1  2015  forecast      120                 90                NA
    10    2  2015  forecast      210                160                NA
    11    3  2015  forecast      150                 60                NA
    12    1  2015    actual      130                120                30
    13    2  2015    actual      100                 60              -110
    14    3  2015    actual      190                140                40
    

    编辑 - 我对此非常接近:

    df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue']
    df['chg_from_prev_year'] = df['revenue'] - df['prev_year']
    
    df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue']
    df['chg_from_forecast'] = df['revenue'] - df['curr_forecast']
    

    唯一遗漏的(正如预期的那样)是2014年预测与预测之间的比较。 2013年实际。我可以在数据集中复制2013年运行,计算2014年预测的chg_from_prev_year,并隐藏/删除最终数据帧中不需要的数据。

1 个答案:

答案 0 :(得分:1)

首先要获得上一年的变化,对每个群体进行转变:

In [11]: g = df.groupby(['ids', 'run'])

In [12]: df['chg_from_prev_year'] = g['revenue'].apply(lambda x: x - x.shift())

下一部分更复杂,我认为你需要为下一部分做pivot_table

In [13]: df1 = df.pivot_table('revenue', ['ids', 'year'], 'run')

In [14]: df1
Out[14]:
run       actual  forecast
ids year
1   2013      10       NaN
    2014      10        30
    2015     130       120
2   2013      20       NaN
    2014      40        50
    2015     100       210
3   2013      20       NaN
    2014      50        90
    2015     190       150

In [15]: g1 = df1.groupby(level='ids', as_index=False)

In [16]: out_by = g1.apply(lambda x: x['actual'] - x['forecast'])

In [17]: out_by  # hello levels bug, fixed in 0.13/master... yesterday :)
Out[17]:
ids  ids  year
1    1    2013    NaN
          2014    -20
          2015     10
2    2    2013    NaN
          2014    -10
          2015   -110
3    3    2013    NaN
          2014    -40
          2015     40
dtype: float64

你想要的结果是什么,但格式不正确(如果你不太讨厌的话,见下文[31])...以下看起来有点像黑客(说得温和一点) ),但是这里:

In [21]: df2 = df.set_index(['ids', 'year', 'run'])

In [22]: out_by.index = out_by.index.droplevel(0)

In [23]: out_by_df = pd.DataFrame(out_by, columns=['revenue'])

In [24]: out_by_df['run'] = 'forecast'

In [25]: df2['chg_from_forecast'] = out_by_df.set_index('run', append=True)['revenue']

我们已经完成了......

In [26]: df2.reset_index()
Out[26]:
    ids  year       run  revenue  chg_from_prev_year  chg_from_forecast
0     1  2013    actual       10                 NaN                NaN
1     2  2013    actual       20                 NaN                NaN
2     3  2013    actual       20                 NaN                NaN
3     1  2014  forecast       30                 NaN                -20
4     2  2014  forecast       50                 NaN                -10
5     3  2014  forecast       90                 NaN                -40
6     1  2014    actual       10                   0                NaN
7     2  2014    actual       40                  20                NaN
8     3  2014    actual       50                  30                NaN
9     1  2015  forecast      120                  90                 10
10    2  2015  forecast      210                 160               -110
11    3  2015  forecast      150                  60                 40
12    1  2015    actual      130                 120                NaN
13    2  2015    actual      100                  60                NaN
14    3  2015    actual      190                 140                NaN

注意:我认为chg_from_prev_year的前6个结果应为NaN。

但是,我认为你可能最好将它作为一个支点:

In [31]: df3 = df.pivot_table(['revenue', 'chg_from_prev_year'], ['ids', 'year'], 'run')

In [32]: df3['chg_from_forecast'] = g1.apply(lambda x: x['actual'] - x['forecast']).values

In [33]: df3
Out[33]:
          revenue            chg_from_prev_year            chg_from_forecast
run        actual  forecast              actual  forecast
ids year
1   2013       10       NaN                 NaN       NaN                NaN
    2014       10        30                   0       NaN                -20
    2015      130       120                 120        90                 10
2   2013       20       NaN                 NaN       NaN                NaN
    2014       40        50                  20       NaN                -10
    2015      100       210                  60       160               -110
3   2013       20       NaN                 NaN       NaN                NaN
    2014       50        90                  30       NaN                -40
    2015      190       150                 140        60                 40