如何在重叠中进行插值时使用具有不规则时间戳的Dataframe的resample.sum()方法?

时间:2016-05-31 16:07:44

标签: python pandas dataframe interpolation resampling

我有一组数据集,如下所示。它有一个开始时间和结束时间。对于每一行,都有相应的值。

Block_start         Block_end           Total  Coal Waste
01/20/2016 5:00     01/20/2016 5:23     1284    0   1284
01/20/2016 5:23     01/20/2016 6:44     5755    0   5755
01/20/2016 6:44     01/20/2016 8:21     8058    0   8058
01/20/2016 8:21     01/20/2016 10:04    8584    0   8584
01/20/2016 10:04    01/20/2016 11:49    8790    0   8790
01/20/2016 11:49    01/20/2016 12:58    3437    0   3437
01/20/2016 12:58    01/20/2016 16:52    19532   0   19532
01/20/2016 16:52    01/20/2016 21:15    21925   0   21925
01/20/2016 21:15    01/21/2016 1:47     22636   0   22636
01/21/2016 1:47     01/21/2016 5:07     16701   0   16701
01/21/2016 5:07     01/21/2016 11:55    10205   0   10205
01/21/2016 11:55    01/21/2016 17:07    25965   0   25965
01/21/2016 17:07    01/21/2016 22:09    25188   0   25188
01/21/2016 22:09    01/22/2016 3:41     27666   0   27666
01/22/2016 3:41     01/22/2016 8:01     21698   0   21698
01/22/2016 8:01     01/22/2016 15:34    11315   0   11315
01/22/2016 15:34    01/22/2016 19:55    21778   0   21778
01/22/2016 19:55    01/23/2016 0:25     22481   0   22481
...

我希望将这些值与每8小时的频率相加,并使用' left'标签和开始时间是凌晨5点。 我在' Block_end'上设置了索引。并尝试重新取样。 我试过了:

df.set_index('Block_end')
df_resamped = df.resample('8H', closed='left', label='left', base=5).sum()

但结果(如下)不是我想要的。

Block_end   Total   Coal    Waste
2016-01-20 13:00:00 35908   0   35908
2016-01-20 21:00:00 19532   0   19532
2016-01-21 05:00:00 44561   0   44561
2016-01-21 13:00:00 26906   0   26906
2016-01-21 21:00:00 25965   0   25965
2016-01-22 05:00:00 52854   0   52854
2016-01-22 13:00:00 21698   0   21698
2016-01-22 21:00:00 33093   0   33093
2016-01-23 05:00:00 44774   0   44774
...

我想要像01/20/2016 21:15这样的重叠,15分钟用于之后和之前的其余部分,但是熊猫不会这样做。它是一种插值。

1 个答案:

答案 0 :(得分:0)

不确定所需的结果,但我相信如果您只想在所需范围的值之间进行插值,则无需进行总和。

从这个DataFrame开始(确保Block_end是你的DatetimeIndex)

df
Out[175]: 
                          Block_start  Total  Coal  Waste
Block_end                                                
2016-01-20 05:23:00   01/20/2016 5:00   1284     0   1284
2016-01-20 06:44:00   01/20/2016 5:23   5755     0   5755
2016-01-20 08:21:00   01/20/2016 6:44   8058     0   8058
2016-01-20 10:04:00   01/20/2016 8:21   8584     0   8584
2016-01-20 11:49:00  01/20/2016 10:04   8790     0   8790
2016-01-20 12:58:00  01/20/2016 11:49   3437     0   3437
2016-01-20 16:52:00  01/20/2016 12:58  19532     0  19532
2016-01-20 21:15:00  01/20/2016 16:52  21925     0  21925
2016-01-21 01:47:00  01/20/2016 21:15  22636     0  22636
2016-01-21 05:07:00   01/21/2016 1:47  16701     0  16701
2016-01-21 11:55:00   01/21/2016 5:07  10205     0  10205
2016-01-21 17:07:00  01/21/2016 11:55  25965     0  25965
2016-01-21 22:09:00  01/21/2016 17:07  25188     0  25188
2016-01-22 03:41:00  01/21/2016 22:09  27666     0  27666
2016-01-22 08:01:00   01/22/2016 3:41  21698     0  21698
2016-01-22 15:34:00   01/22/2016 8:01  11315     0  11315
2016-01-22 19:55:00  01/22/2016 15:34  21778     0  21778
2016-01-23 00:25:00  01/22/2016 19:55  22481     0  22481

首先定义所需的结果范围:

rng = pd.date_range(start=pd.Timestamp('2016-01-20 13:00'), end=pd.Timestamp('2016-01-22 21:00'), freq='8 h')

然后重新取样您的DataFrame每分钟,使用DataFrame.interpolate(),然后使用所需范围重新索引

df_resamped = df.resample('min').interpolate().reindex(rng)

df_resamped
Out[178]: 
                            Total  Coal         Waste
2016-01-20 13:00:00   3574.564103     0   3574.564103
2016-01-20 21:00:00  21788.517110     0  21788.517110
2016-01-21 05:00:00  16908.725000     0  16908.725000
2016-01-21 13:00:00  13488.333333     0  13488.333333
2016-01-21 21:00:00  25365.526490     0  25365.526490
2016-01-22 05:00:00  25852.646154     0  25852.646154
2016-01-22 13:00:00  14844.761589     0  14844.761589
2016-01-22 21:00:00  21947.240741     0  21947.240741