对x数据框求和并替换

时间:2018-10-18 07:47:58

标签: python pandas dataframe

我有以下数据框:

                Date from             Date to  Actuals
4669  2017-12-22 06:00:00 2017-12-22 06:05:00       75
4670  2017-12-22 06:05:00 2017-12-22 06:10:00       81
4671  2017-12-22 06:10:00 2017-12-22 06:15:00       84
4672  2017-12-22 06:15:00 2017-12-22 06:20:00       78
4673  2017-12-22 06:20:00 2017-12-22 06:25:00       93
4674  2017-12-22 06:25:00 2017-12-22 06:30:00       93
4675  2017-12-22 06:30:00 2017-12-22 06:35:00       99
4676  2017-12-22 06:35:00 2017-12-22 06:40:00      102
4677  2017-12-22 06:40:00 2017-12-22 06:45:00      102
4678  2017-12-22 06:45:00 2017-12-22 06:50:00      108
4679  2017-12-22 06:50:00 2017-12-22 06:55:00      129
4680  2017-12-22 06:55:00 2017-12-22 07:00:00      135
4681  2017-12-22 07:00:00 2017-12-22 07:05:00      126
4682  2017-12-22 07:05:00 2017-12-22 07:10:00      111
4683  2017-12-22 07:10:00 2017-12-22 07:15:00       96
4684  2017-12-22 07:15:00 2017-12-22 07:20:00      111
4685  2017-12-22 07:20:00 2017-12-22 07:25:00      105
4686  2017-12-22 07:25:00 2017-12-22 07:30:00       99
4687  2017-12-22 07:30:00 2017-12-22 07:35:00      111
4688  2017-12-22 07:35:00 2017-12-22 07:40:00      129
4689  2017-12-22 07:40:00 2017-12-22 07:45:00      123
4690  2017-12-22 07:45:00 2017-12-22 07:50:00      138
4691  2017-12-22 07:50:00 2017-12-22 07:55:00      141
4692  2017-12-22 07:55:00 2017-12-22 08:00:00      156
4693  2017-12-22 08:00:00 2017-12-22 08:05:00      147
4694  2017-12-22 08:05:00 2017-12-22 08:10:00      120
4695  2017-12-22 08:10:00 2017-12-22 08:15:00       99
4696  2017-12-22 08:15:00 2017-12-22 08:20:00       75
4697  2017-12-22 08:20:00 2017-12-22 08:25:00       57
4698  2017-12-22 08:25:00 2017-12-22 08:30:00       45
                  ...                 ...      ...
53855 2018-10-08 03:30:00 2018-10-08 03:35:00        0
53856 2018-10-08 03:35:00 2018-10-08 03:40:00        0
53857 2018-10-08 03:40:00 2018-10-08 03:45:00        0
53858 2018-10-08 03:45:00 2018-10-08 03:50:00        0
53859 2018-10-08 03:50:00 2018-10-08 03:55:00        0
53860 2018-10-08 03:55:00 2018-10-08 04:00:00        0
53861 2018-10-08 04:00:00 2018-10-08 04:05:00        0
53862 2018-10-08 04:05:00 2018-10-08 04:10:00        0
53863 2018-10-08 04:10:00 2018-10-08 04:15:00        0
53864 2018-10-08 04:15:00 2018-10-08 04:20:00        0
53865 2018-10-08 04:20:00 2018-10-08 04:25:00        0
53866 2018-10-08 04:25:00 2018-10-08 04:30:00        0
53867 2018-10-08 04:30:00 2018-10-08 04:35:00        0
53868 2018-10-08 04:35:00 2018-10-08 04:40:00        0
53869 2018-10-08 04:40:00 2018-10-08 04:45:00        0
53870 2018-10-08 04:45:00 2018-10-08 04:50:00        0
53871 2018-10-08 04:50:00 2018-10-08 04:55:00        0
53872 2018-10-08 04:55:00 2018-10-08 05:00:00        0
53873 2018-10-08 05:00:00 2018-10-08 05:05:00        0
53874 2018-10-08 05:05:00 2018-10-08 05:10:00        0
53875 2018-10-08 05:10:00 2018-10-08 05:15:00        0
53876 2018-10-08 05:15:00 2018-10-08 05:20:00        0
53877 2018-10-08 05:20:00 2018-10-08 05:25:00        0
53878 2018-10-08 05:25:00 2018-10-08 05:30:00        0
53879 2018-10-08 05:30:00 2018-10-08 05:35:00        0
53880 2018-10-08 05:35:00 2018-10-08 05:40:00        0
53881 2018-10-08 05:40:00 2018-10-08 05:45:00        0
53882 2018-10-08 05:45:00 2018-10-08 05:50:00        0
53883 2018-10-08 05:50:00 2018-10-08 05:55:00        1
53884 2018-10-08 05:55:00 2018-10-08 06:00:00        0

[83324 rows x 3 columns]

我想添加行,以便获得每小时的累计值。所需结果:

             Date from             Date to  Actuals
1  2017-12-22 06:00:00 2017-12-22 07:00:00     1179
2  2017-12-22 07:00:00 2017-12-22 08:00:00     1157
                  ...                 ...      ...
1000 2018-10-08 05:00:00 2018-10-08 06:00:00      1

我使用DataFrame.sum()进行了尝试,但是我只能在对整个列求和而不是基于datetime的子部分求和时进行此操作。有什么建议么?

ps:在这种情况下,数据框中每5分钟有一行。但是我可以想象,如果不是这样,这应该是可能的。

编辑:使用Statistic Dean的答案,我发现这不是一个完美填充的数据框。

2 个答案:

答案 0 :(得分:3)

一个简单的方法(尽管输出的结构与您要的结构不完全相同,但是很容易操纵)是使用pandas.Groupergroupby小时,然后求和实际值,即

import pandas
import random

#Creating the data frame
d = pandas.date_range('2017-12-22 06:00:00', periods = 50, freq = '5min')
d1 = pandas.date_range('2017-12-22 06:05:00', periods = 50, freq = '5min')
d2 = random.sample(range(1000), 50)
df = pandas.DataFrame({'Date_From':d, 
                       'Date_To':d1, 
                       'Actuals':d2})

(df
  .set_index('Date_From')
  .groupby(pandas.Grouper(freq = 'H'))['Actuals']
  .sum())

给出,

Date_From
2017-12-22 06:00:00    5194
2017-12-22 07:00:00    5790
2017-12-22 08:00:00    5760
2017-12-22 09:00:00    6298
2017-12-22 10:00:00    1070
Freq: H, Name: Actuals, dtype: int64

答案 1 :(得分:0)

您可以注意到的一件事是您一次必须累加12个术语。因此,一种解决方案是遍历您的数据框,一次累加12个术语,从第一个术语开始,最后一个术语停止。您只需要注意边界。我们称您的数据框为df。

n = df.shape[0]//12 # The number of row you'll have
cumulative = np.zeros(n)
date_from = []
date_to = []
# Now go through the dataframe 12 steps at a time
for i in range(n):
    cumulative[i] = df.iloc[12*i:12*(i+1),2].sum() # Get the sum for the hour
    date_from.append(df.iloc[12*i,0]) # Get the starting instant
    date_to.append(df.iloc[12*i+11,1]) # Get the ending instant
# Now create your new dataframe
new_df = pd.DataFrame({Date_from: date_from, Date_to: date_to, Actuals: cumulative})

正如我之前所说,这项工作只能在正确的边界(第一行是一个小时的开始)进行,直到最后一个完整的小时。