列组中的平均值

时间:2013-09-11 16:11:45

标签: python pandas

我有一个包含月度值的数据框,我希望获得平均值为3个月组的四分之一值 我的数据是这样的(例子只有前9个月)

month        01     02     03     04     05     06     07     08     09  \
year                                                                  
2000       90.26  90.95  91.04  90.87  90.78  91.13  90.87  90.95  91.30   
2000       87.89  89.68  90.10  90.27  90.53  90.87  89.93  91.30  91.98   
2000       74.17  74.98  74.74  73.97  74.07  74.26  74.71  76.93  78.67   
2000        NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
2000       86.74  85.48  87.45  88.31  88.71  88.23  88.08  87.76  88.94 

我希望获得Q1作为月份01,02,03的平均值。我能做到:

 df['Q1']=(df['01']+df['02']+df['03'])/3

但我会对Nan有问题。

我可以用三个月的小组计算平均值吗?

3 个答案:

答案 0 :(得分:2)

您可以使用loc和mean手动执行此操作:

In [11]: df.loc[:, ['01', '02', '03']]
Out[11]: 
         01     02     03
year                    
2000  90.26  90.95  91.04
2000  87.89  89.68  90.10
2000  74.17  74.98  74.74
2000    NaN    NaN    NaN
2000  86.74  85.48  87.45

In [12]: df.loc[:, ['01', '02', '03']].mean(axis=1)
Out[12]: 
year
2000    90.750000
2000    89.223333
2000    74.630000
2000          NaN
2000    86.556667
dtype: float64

但使用pandas'rolling_mean

可能更有意义
In [21]: pd.rolling_mean(df.T, 3)
Out[21]: 
year        2000       2000       2000  2000       2000
month                                                  
01           NaN        NaN        NaN   NaN        NaN
02           NaN        NaN        NaN   NaN        NaN
03     90.750000  89.223333  74.630000   NaN  86.556667
04     90.953333  90.016667  74.563333   NaN  87.080000
05     90.896667  90.300000  74.260000   NaN  88.156667
06     90.926667  90.556667  74.100000   NaN  88.416667
07     90.926667  90.443333  74.346667   NaN  88.340000
08     90.983333  90.700000  75.300000   NaN  88.023333
09     91.040000  91.070000  76.770000   NaN  88.260000

默认情况下会查看3个句点,因此我们必须将其向上移动两个:

In [22]: pd.rolling_mean(df.T, 3).shift(-2)
Out[22]: 
year        2000       2000       2000  2000       2000
month                                                  
01     90.750000  89.223333  74.630000   NaN  86.556667
02     90.953333  90.016667  74.563333   NaN  87.080000
03     90.896667  90.300000  74.260000   NaN  88.156667
04     90.926667  90.556667  74.100000   NaN  88.416667
05     90.926667  90.443333  74.346667   NaN  88.340000
06     90.983333  90.700000  75.300000   NaN  88.023333
07     91.040000  91.070000  76.770000   NaN  88.260000
08           NaN        NaN        NaN   NaN        NaN
09           NaN        NaN        NaN   NaN        NaN

并转换为正确的形式:

In [23]: pd.rolling_mean(df.T, 3).shift(-2).T
Out[23]: 
month         01         02         03         04         05         06      07   08   09
year                                                                      
2000   90.750000  90.953333  90.896667  90.926667  90.926667  90.983333   91.04  NaN  NaN 
2000   89.223333  90.016667  90.300000  90.556667  90.443333  90.700000   91.07  NaN  NaN 
2000   74.630000  74.563333  74.260000  74.100000  74.346667  75.300000   76.77  NaN  NaN 
2000         NaN        NaN        NaN        NaN        NaN        NaN     NaN  NaN  NaN  
2000   86.556667  87.080000  88.156667  88.416667  88.340000  88.023333   88.26  NaN  NaN 

答案 1 :(得分:1)

使用重新采样。

In [89]: x
Out[89]: 
           1      2      3      4      5      6      7      8      9
month                                                               
2000   90.26  90.95  91.04  90.87  90.78  91.13  90.87  90.95  91.30
2000   87.89  89.68  90.10  90.27  90.53  90.87  89.93  91.30  91.98
2000   74.17  74.98  74.74  73.97  74.07  74.26  74.71  76.93  78.67
2000     NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN
2000   86.74  85.48  87.45  88.31  88.71  88.23  88.08  87.76  88.94

In [90]: x.columns = pd.PeriodIndex([pd.Period(year=2000, month=m, freq='M')
                                         for m in x.columns])

In [92]: x.index = ['%s_%s' % (y,i) for i, y in enumerate(x.index)]

In [93]: x
Out[93]: 
        2000-01  2000-02  2000-03  2000-04  2000-05  2000-06  2000-07  2000-08  2000-09
2000_0    90.26    90.95    91.04    90.87    90.78    91.13    90.87    90.95    91.30
2000_1    87.89    89.68    90.10    90.27    90.53    90.87    89.93    91.30    91.98
2000_2    74.17    74.98    74.74    73.97    74.07    74.26    74.71    76.93    78.67
2000_3      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN      NaN
2000_4    86.74    85.48    87.45    88.31    88.71    88.23    88.08    87.76    88.94

In [94]: x.resample('Q', axis=1)
Out[94]: 
           2000Q1     2000Q2  2000Q3
2000_0  90.750000  90.926667   91.04
2000_1  89.223333  90.556667   91.07
2000_2  74.630000  74.100000   76.77
2000_3        NaN        NaN     NaN
2000_4  86.556667  88.416667   88.26

对重复索引进行重新采样存在一个错误,这就是我在此重命名索引的原因,它固定为0.13(但此解决方案使用的是0.12)。

这最终是最灵活的,因为您现在可以按不同的频率重新采样。

In [95]: x.resample('Q-JAN', axis=1)
Out[95]: 
        2000Q4     2001Q1     2001Q2  2001Q3
2000_0   90.26  90.953333  90.926667  91.125
2000_1   87.89  90.016667  90.443333  91.640
2000_2   74.17  74.563333  74.346667  77.800
2000_3     NaN        NaN        NaN     NaN
2000_4   86.74  87.080000  88.340000  88.350

答案 2 :(得分:0)

import pandas as pd
import io

content = io.BytesIO('''\
year        01     02     03     04     05     06     07     08     09  
2000       90.26  90.95  91.04  90.87  90.78  91.13  90.87  90.95  91.30   
2000       87.89  89.68  90.10  90.27  90.53  90.87  89.93  91.30  91.98   
2000       74.17  74.98  74.74  73.97  74.07  74.26  74.71  76.93  78.67   
2000        NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
2000       86.74  85.48  87.45  88.31  88.71  88.23  88.08  87.76  88.94''')

df = pd.read_table(content, sep='\s+', index_col=0)
df.columns.name='month'
df2 = df.groupby(by=lambda x: (int(x)-1)//3, axis=1).mean()
df2.columns='Q1 Q2 Q3'.split()
print(df2)

             Q1         Q2     Q3
year                             
2000  90.750000  90.926667  91.04
2000  89.223333  90.556667  91.07
2000  74.630000  74.100000  76.77
2000        NaN        NaN    NaN
2000  86.556667  88.416667  88.26

您可以使用

将这些列连接到原始数据框
df = pd.concat([df2, df], axis=1)