计算按小时分组的列总和

时间:2019-04-03 01:22:16

标签: python pandas dataframe

我正在尝试计算一天中所需的人员总费用。我的尝试是全天group People所需,并增加成本。然后,我尝试group每小时费用。但是我的输出不正确。

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dates

d = ({
    'Time' : ['0/1/1900 8:00:00','0/1/1900 9:59:00','0/1/1900 10:00:00','0/1/1900 12:29:00','0/1/1900 12:30:00','0/1/1900 13:00:00','0/1/1900 13:02:00','0/1/1900 13:15:00','0/1/1900 13:20:00','0/1/1900 18:10:00','0/1/1900 18:15:00','0/1/1900 18:20:00','0/1/1900 18:25:00','0/1/1900 18:45:00','0/1/1900 18:50:00','0/1/1900 19:05:00','0/1/1900 19:07:00','0/1/1900 21:57:00','0/1/1900 22:00:00','0/1/1900 22:30:00','0/1/1900 22:35:00','1/1/1900 3:00:00','1/1/1900 3:05:00','1/1/1900 3:20:00','1/1/1900 3:25:00'],                 
    'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],                      
     })

df = pd.DataFrame(data = d)

df['Time'] = ['/'.join([str(int(x.split('/')[0])+1)] + x.split('/')[1:]) for x in df['Time']]
df['Time'] = pd.to_datetime(df['Time'], format='%d/%m/%Y %H:%M:%S')
formatter = dates.DateFormatter('%Y-%m-%d %H:%M:%S') 

df = df.groupby(pd.Grouper(freq='15T',key='Time'))['People'].max().ffill()
df = df.reset_index(level=['Time'])

df['Cost'] = df['People'] * 26

cost = df.groupby([df['Time'].dt.hour])['Cost'].sum()

#For reference. This plot displays people required throughout the day
fig, ax = plt.subplots(figsize = (10,5))
plt.plot(df['Time'], df['People'], color = 'blue')

plt.locator_params(axis='y', nbins=6)
ax.xaxis.set_major_formatter(formatter)
ax.xaxis.set_major_formatter(dates.DateFormatter('%H:%M:%S'))
plt.ylabel('People Required', labelpad = 10)
plt.xlabel('Time', labelpad = 10)

print(cost)

出局:

0     416.0
1     416.0
2     416.0
3     130.0
8     104.0
9     104.0
10    208.0
11    208.0
12    260.0
13    312.0
14    312.0
15    312.0
16    312.0
17    312.0
18    364.0
19    312.0
20    312.0
21    312.0
22    416.0
23    416.0

我手动进行了计算,总成本输出应为:

$1456

1 个答案:

答案 0 :(得分:1)

我认为您的问题中的数字错误很可能是由于日期时间值不正确引起的。解决此问题后,您应该会获得正确的数字。这是我的最终尝试,对 Time 列进行了一些调整。

import pandas as pd

df = pd.DataFrame({
    'Time' : ['1/1/1900 8:00:00','1/1/1900 9:59:00','1/1/1900 10:00:00','1/1/1900 12:29:00','1/1/1900 12:30:00','1/1/1900 13:00:00','1/1/1900 13:02:00','1/1/1900 13:15:00','1/1/1900 13:20:00','1/1/1900 18:10:00','1/1/1900 18:15:00','1/1/1900 18:20:00','1/1/1900 18:25:00','1/1/1900 18:45:00','1/1/1900 18:50:00','1/1/1900 19:05:00','1/1/1900 19:07:00','1/1/1900 21:57:00','1/1/1900 22:00:00','1/1/1900 22:30:00','1/1/1900 22:35:00','1/2/1900 3:00:00','1/2/1900 3:05:00','1/2/1900 3:20:00','1/2/1900 3:25:00'],
    'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],                      
     })

>>>df
                 Time  People
0    1/1/1900 8:00:00       1
1    1/1/1900 9:59:00       1
2   1/1/1900 10:00:00       2
3   1/1/1900 12:29:00       2
4   1/1/1900 12:30:00       3
5   1/1/1900 13:00:00       3
6   1/1/1900 13:02:00       2
7   1/1/1900 13:15:00       2
8   1/1/1900 13:20:00       3
9   1/1/1900 18:10:00       3
10  1/1/1900 18:15:00       4
11  1/1/1900 18:20:00       4
12  1/1/1900 18:25:00       3
13  1/1/1900 18:45:00       3
14  1/1/1900 18:50:00       2
15  1/1/1900 19:05:00       2
16  1/1/1900 19:07:00       3
17  1/1/1900 21:57:00       3
18  1/1/1900 22:00:00       4
19  1/1/1900 22:30:00       4
20  1/1/1900 22:35:00       3
21   1/2/1900 3:00:00       3
22   1/2/1900 3:05:00       2
23   1/2/1900 3:20:00       2
24   1/2/1900 3:25:00       1

df.Time = pd.to_datetime(df.Time)
df.Time.set_index('Time', inplace=True)
df_group = df.resample('15T').max().ffill()
df_hour = df_group.resample('1h').max()
df_hour['Cost'] = df_hour['People'] * 26

>>>df_hour
                     People   Cost
Time
1900-01-01 08:00:00     1.0   26.0
1900-01-01 09:00:00     1.0   26.0
1900-01-01 10:00:00     2.0   52.0
1900-01-01 11:00:00     2.0   52.0
1900-01-01 12:00:00     3.0   78.0
1900-01-01 13:00:00     3.0   78.0
1900-01-01 14:00:00     3.0   78.0
1900-01-01 15:00:00     3.0   78.0
1900-01-01 16:00:00     3.0   78.0
1900-01-01 17:00:00     3.0   78.0
1900-01-01 18:00:00     4.0  104.0
1900-01-01 19:00:00     3.0   78.0
1900-01-01 20:00:00     3.0   78.0
1900-01-01 21:00:00     3.0   78.0
1900-01-01 22:00:00     4.0  104.0
1900-01-01 23:00:00     4.0  104.0
1900-01-02 00:00:00     4.0  104.0
1900-01-02 01:00:00     4.0  104.0
1900-01-02 02:00:00     4.0  104.0
1900-01-02 03:00:00     3.0   78.0

>>>df_hour.sum()
People      60.0
Cost      1560.0
dtype: float64

编辑:让我第二次阅读了以了解您所使用的方法。您对汇总的列执行了sum()后,按ffill()进行了分组,因此得出的数字不正确。由于ffill()填补了上一个有效值的空白,因此您实际上高估了这两个时期的成本。您应该再次使用max(),以查找该小时所需的最大人数。