Question

我希望将一些偏差应用于数据帧的每月粒度结构，然后在初始数据帧中重新设置它。我首先做一个groupby和聚合。这部分效果很好。然后我重新索引并拿走NaN。我希望重新索引将通过将groupby元素的月份与初始数据帧匹配来完成。我希望能够以不同的粒度（年度 - >月份和年份，......）进行此操作

有人解决了这个问题吗？

>>> df['profile']
date
2015-01-01 00:00:00    3.000000
2015-01-01 01:00:00    3.000143
2015-01-01 02:00:00    3.000287
2015-01-01 03:00:00    3.000430
2015-01-01 04:00:00    3.000574
...
2015-12-31 20:00:00    2.999426
2015-12-31 21:00:00    2.999570
2015-12-31 22:00:00    2.999713
2015-12-31 23:00:00    2.999857
Freq: H, Name: profile, Length: 8760

### Deviation on monthly basis
>>> dev_monthly = np.random.uniform(0.5, 1.5, len(df['profile'].groupby(df.index.month).aggregate(np.sum)))


>>> df['profile_monthly'] = (df['profile'].groupby(df.index.month).aggregate(np.sum) * dev_monthly).reindex(df)

>>> df['profile_monthly']
date
2015-01-01 00:00:00   NaN
2015-01-01 01:00:00   NaN
2015-01-01 02:00:00   NaN
...
2015-12-31 22:00:00   NaN
2015-12-31 23:00:00   NaN
Freq: H, Name: profile_monthly, Length: 8760

Answer 1

查看documentation for resampling。

您正在寻找resample，然后fillna method='bfill'寻找<{1}}：

In [105]: df = DataFrame({'profile': normal(3, 0.1, size=10000)}, pd.date_range(start='2015-01-
01', freq='H', periods=10000))

In [106]: df['profile_monthly'] = df.profile.resample('M', how='sum')

In [107]: df
Out[107]:
                     profile  profile_monthly
2015-01-01 00:00:00   2.8328              NaN
2015-01-01 01:00:00   3.0607              NaN
2015-01-01 02:00:00   3.0138              NaN
2015-01-01 03:00:00   3.0402              NaN
2015-01-01 04:00:00   3.0335              NaN
2015-01-01 05:00:00   3.0087              NaN
2015-01-01 06:00:00   3.0557              NaN
2015-01-01 07:00:00   2.9280              NaN
2015-01-01 08:00:00   3.1359              NaN
2015-01-01 09:00:00   2.9681              NaN
2015-01-01 10:00:00   3.1240              NaN
2015-01-01 11:00:00   3.0635              NaN
2015-01-01 12:00:00   2.9206              NaN
2015-01-01 13:00:00   3.0714              NaN
2015-01-01 14:00:00   3.0688              NaN
2015-01-01 15:00:00   3.0703              NaN
2015-01-01 16:00:00   2.9102              NaN
2015-01-01 17:00:00   2.9368              NaN
2015-01-01 18:00:00   3.0864              NaN
2015-01-01 19:00:00   3.2124              NaN
2015-01-01 20:00:00   2.8988              NaN
2015-01-01 21:00:00   3.0659              NaN
2015-01-01 22:00:00   2.7973              NaN
2015-01-01 23:00:00   3.0824              NaN
2015-01-02 00:00:00   3.0199              NaN
                         ...              ...

[10000 rows x 2 columns]

In [108]: df.dropna()
Out[108]:
            profile  profile_monthly
2015-01-31   2.9769        2230.9931
2015-02-28   2.9930        2016.1045
2015-03-31   2.7817        2232.4096
2015-04-30   3.1695        2158.7834
2015-05-31   2.9040        2236.5962
2015-06-30   2.8697        2162.7784
2015-07-31   2.9278        2231.7232
2015-08-31   2.8289        2236.4603
2015-09-30   3.0368        2163.5916
2015-10-31   3.1517        2233.2285
2015-11-30   3.0450        2158.6998
2015-12-31   2.8261        2228.5550
2016-01-31   3.0264        2229.2221

[13 rows x 2 columns]

In [110]: df.fillna(method='bfill')
Out[110]:
                     profile  profile_monthly
2015-01-01 00:00:00   2.8328        2230.9931
2015-01-01 01:00:00   3.0607        2230.9931
2015-01-01 02:00:00   3.0138        2230.9931
2015-01-01 03:00:00   3.0402        2230.9931
2015-01-01 04:00:00   3.0335        2230.9931
2015-01-01 05:00:00   3.0087        2230.9931
2015-01-01 06:00:00   3.0557        2230.9931
2015-01-01 07:00:00   2.9280        2230.9931
2015-01-01 08:00:00   3.1359        2230.9931
2015-01-01 09:00:00   2.9681        2230.9931
2015-01-01 10:00:00   3.1240        2230.9931
2015-01-01 11:00:00   3.0635        2230.9931
2015-01-01 12:00:00   2.9206        2230.9931
2015-01-01 13:00:00   3.0714        2230.9931
2015-01-01 14:00:00   3.0688        2230.9931
2015-01-01 15:00:00   3.0703        2230.9931
2015-01-01 16:00:00   2.9102        2230.9931
2015-01-01 17:00:00   2.9368        2230.9931
2015-01-01 18:00:00   3.0864        2230.9931
2015-01-01 19:00:00   3.2124        2230.9931
2015-01-01 20:00:00   2.8988        2230.9931
2015-01-01 21:00:00   3.0659        2230.9931
2015-01-01 22:00:00   2.7973        2230.9931
2015-01-01 23:00:00   3.0824        2230.9931
2015-01-02 00:00:00   3.0199        2230.9931
                         ...              ...

[10000 rows x 2 columns]

Answer 2

当我使用您的代码时，我在2015-12-31 00:00:00和2015-12-31 01:00:00没有相同的价值，如下所示：

>>> df.fillna(method='bfill')[np.logical_and(df.index.month==12, df.index.day==31)]
                    profile  profile_monthly
2015-12-31 00:00:00  2.926504      2232.288997
2015-12-31 01:00:00  3.008543      2234.470731
2015-12-31 02:00:00  2.930133      2234.470731
2015-12-31 03:00:00  3.078552      2234.470731
2015-12-31 04:00:00  3.141578      2234.470731
2015-12-31 05:00:00  3.061820      2234.470731
2015-12-31 06:00:00  2.981626      2234.470731
2015-12-31 07:00:00  3.010749      2234.470731
2015-12-31 08:00:00  2.878577      2234.470731
2015-12-31 09:00:00  2.915487      2234.470731
2015-12-31 10:00:00  3.072721      2234.470731
2015-12-31 11:00:00  3.087866      2234.470731
2015-12-31 12:00:00  3.089208      2234.470731
2015-12-31 13:00:00  2.957047      2234.470731
2015-12-31 14:00:00  3.002072      2234.470731
2015-12-31 15:00:00  3.106656      2234.470731
2015-12-31 16:00:00  3.100891      2234.470731
2015-12-31 17:00:00  3.077835      2234.470731
2015-12-31 18:00:00  3.032497      2234.470731
2015-12-31 19:00:00  2.959838      2234.470731
2015-12-31 20:00:00  2.878819      2234.470731
2015-12-31 21:00:00  3.041171      2234.470731
2015-12-31 22:00:00  3.061970      2234.470731
2015-12-31 23:00:00  3.019011      2234.470731

[24 rows x 2 columns]

所以我终于找到了以下解决方案：

>>> AA  = df.groupby((df.index.year, df.index.month)).aggregate(np.mean)
>>> AA['dev'] = np.random.randn(0,1,len(AA))
>>> df['dev'] = AA.ix[zip(df.index.year, df.index.month)]['dev'].values

简短快速。唯一的问题是：

=＆GT;如何处理其他粒度（半年，季度，周，......）？

groupby.aggregate并修改然后cast / reindex

2 个答案: