我有一个包含月度值的数据框,我希望获得平均值为3个月组的四分之一值 我的数据是这样的(例子只有前9个月)
month 01 02 03 04 05 06 07 08 09 \
year
2000 90.26 90.95 91.04 90.87 90.78 91.13 90.87 90.95 91.30
2000 87.89 89.68 90.10 90.27 90.53 90.87 89.93 91.30 91.98
2000 74.17 74.98 74.74 73.97 74.07 74.26 74.71 76.93 78.67
2000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2000 86.74 85.48 87.45 88.31 88.71 88.23 88.08 87.76 88.94
我希望获得Q1作为月份01,02,03的平均值。我能做到:
df['Q1']=(df['01']+df['02']+df['03'])/3
但我会对Nan有问题。
我可以用三个月的小组计算平均值吗?
答案 0 :(得分:2)
您可以使用loc和mean手动执行此操作:
In [11]: df.loc[:, ['01', '02', '03']]
Out[11]:
01 02 03
year
2000 90.26 90.95 91.04
2000 87.89 89.68 90.10
2000 74.17 74.98 74.74
2000 NaN NaN NaN
2000 86.74 85.48 87.45
In [12]: df.loc[:, ['01', '02', '03']].mean(axis=1)
Out[12]:
year
2000 90.750000
2000 89.223333
2000 74.630000
2000 NaN
2000 86.556667
dtype: float64
但使用pandas'rolling_mean:
可能更有意义In [21]: pd.rolling_mean(df.T, 3)
Out[21]:
year 2000 2000 2000 2000 2000
month
01 NaN NaN NaN NaN NaN
02 NaN NaN NaN NaN NaN
03 90.750000 89.223333 74.630000 NaN 86.556667
04 90.953333 90.016667 74.563333 NaN 87.080000
05 90.896667 90.300000 74.260000 NaN 88.156667
06 90.926667 90.556667 74.100000 NaN 88.416667
07 90.926667 90.443333 74.346667 NaN 88.340000
08 90.983333 90.700000 75.300000 NaN 88.023333
09 91.040000 91.070000 76.770000 NaN 88.260000
默认情况下会查看3个句点,因此我们必须将其向上移动两个:
In [22]: pd.rolling_mean(df.T, 3).shift(-2)
Out[22]:
year 2000 2000 2000 2000 2000
month
01 90.750000 89.223333 74.630000 NaN 86.556667
02 90.953333 90.016667 74.563333 NaN 87.080000
03 90.896667 90.300000 74.260000 NaN 88.156667
04 90.926667 90.556667 74.100000 NaN 88.416667
05 90.926667 90.443333 74.346667 NaN 88.340000
06 90.983333 90.700000 75.300000 NaN 88.023333
07 91.040000 91.070000 76.770000 NaN 88.260000
08 NaN NaN NaN NaN NaN
09 NaN NaN NaN NaN NaN
并转换为正确的形式:
In [23]: pd.rolling_mean(df.T, 3).shift(-2).T
Out[23]:
month 01 02 03 04 05 06 07 08 09
year
2000 90.750000 90.953333 90.896667 90.926667 90.926667 90.983333 91.04 NaN NaN
2000 89.223333 90.016667 90.300000 90.556667 90.443333 90.700000 91.07 NaN NaN
2000 74.630000 74.563333 74.260000 74.100000 74.346667 75.300000 76.77 NaN NaN
2000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2000 86.556667 87.080000 88.156667 88.416667 88.340000 88.023333 88.26 NaN NaN
答案 1 :(得分:1)
使用重新采样。
In [89]: x
Out[89]:
1 2 3 4 5 6 7 8 9
month
2000 90.26 90.95 91.04 90.87 90.78 91.13 90.87 90.95 91.30
2000 87.89 89.68 90.10 90.27 90.53 90.87 89.93 91.30 91.98
2000 74.17 74.98 74.74 73.97 74.07 74.26 74.71 76.93 78.67
2000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2000 86.74 85.48 87.45 88.31 88.71 88.23 88.08 87.76 88.94
In [90]: x.columns = pd.PeriodIndex([pd.Period(year=2000, month=m, freq='M')
for m in x.columns])
In [92]: x.index = ['%s_%s' % (y,i) for i, y in enumerate(x.index)]
In [93]: x
Out[93]:
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06 2000-07 2000-08 2000-09
2000_0 90.26 90.95 91.04 90.87 90.78 91.13 90.87 90.95 91.30
2000_1 87.89 89.68 90.10 90.27 90.53 90.87 89.93 91.30 91.98
2000_2 74.17 74.98 74.74 73.97 74.07 74.26 74.71 76.93 78.67
2000_3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2000_4 86.74 85.48 87.45 88.31 88.71 88.23 88.08 87.76 88.94
In [94]: x.resample('Q', axis=1)
Out[94]:
2000Q1 2000Q2 2000Q3
2000_0 90.750000 90.926667 91.04
2000_1 89.223333 90.556667 91.07
2000_2 74.630000 74.100000 76.77
2000_3 NaN NaN NaN
2000_4 86.556667 88.416667 88.26
对重复索引进行重新采样存在一个错误,这就是我在此重命名索引的原因,它固定为0.13(但此解决方案使用的是0.12)。
这最终是最灵活的,因为您现在可以按不同的频率重新采样。
In [95]: x.resample('Q-JAN', axis=1)
Out[95]:
2000Q4 2001Q1 2001Q2 2001Q3
2000_0 90.26 90.953333 90.926667 91.125
2000_1 87.89 90.016667 90.443333 91.640
2000_2 74.17 74.563333 74.346667 77.800
2000_3 NaN NaN NaN NaN
2000_4 86.74 87.080000 88.340000 88.350
答案 2 :(得分:0)
import pandas as pd
import io
content = io.BytesIO('''\
year 01 02 03 04 05 06 07 08 09
2000 90.26 90.95 91.04 90.87 90.78 91.13 90.87 90.95 91.30
2000 87.89 89.68 90.10 90.27 90.53 90.87 89.93 91.30 91.98
2000 74.17 74.98 74.74 73.97 74.07 74.26 74.71 76.93 78.67
2000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2000 86.74 85.48 87.45 88.31 88.71 88.23 88.08 87.76 88.94''')
df = pd.read_table(content, sep='\s+', index_col=0)
df.columns.name='month'
df2 = df.groupby(by=lambda x: (int(x)-1)//3, axis=1).mean()
df2.columns='Q1 Q2 Q3'.split()
print(df2)
Q1 Q2 Q3
year
2000 90.750000 90.926667 91.04
2000 89.223333 90.556667 91.07
2000 74.630000 74.100000 76.77
2000 NaN NaN NaN
2000 86.556667 88.416667 88.26
您可以使用
将这些列连接到原始数据框df = pd.concat([df2, df], axis=1)