Python Pandas用重叠的会议展平日历以获得会议的实际时间

时间:2016-07-30 14:06:21

标签: python datetime pandas dataframe

我将每周日历的详细信息(显然是将受试者更改以保护无辜者)读入大熊猫数据框。我的目标之一是获得会议的总时间。我希望有一个由date_range索引的数据框,其中每周的小时频率显示我在这几个小时内的会议总分钟数。我的第一个挑战是会议重叠,并且我希望同时在两个地方,我肯定不会。我跳出一个,但跳到另一个。因此,例如,索引8和9的行应该是90分钟而不是120分钟的总会议时间,如果我只是df ['持续时间'],则总和会议时间。总和()列。如何将数据帧中的时间段展平为仅计算重叠一次?似乎有一个使用date_range和句点的答案,但我无法绕过它。下面是我的数据帧df。

               Start                End Duration    Subject
0   07/04/16 10:30:00  07/04/16 11:00:00    30      Inspirational Poster Design Session
1   07/04/16 15:00:00  07/04/16 15:30:00    30      Corporate Speak Do's and Don'ts
2   07/04/16 09:00:00  07/04/16 12:00:00    180     Metrics or Matrix -Panel Discussion
3   07/04/16 13:30:00  07/04/16 15:00:00    90      "Do More with Less" kickoff party
4   07/05/16 09:00:00  07/05/16 10:00:00    60      Fiscal or Physical -Panel Discussion
5   07/05/16 14:00:00  07/05/16 14:30:00    30      "Why we can't have nice thing" training video
6   07/06/16 15:00:00  07/06/16 16:00:00    60      One-on-One with manager -Panel Discussion
7   07/06/16 09:00:00  07/06/16 10:00:00    60      Fireing for Performance leadership session
8   07/06/16 13:00:00  07/06/16 14:00:00    60      Birthday Cake in the conference room *MANDATORY*
9   07/06/16 12:30:00  07/06/16 13:30:00    60      Obligatory lunchtime meeting because it was the only time everyone had avaiable

非常感谢任何帮助。

编辑: 这是我希望使用上述数据集的输出。

2016-07-04 00:00:00 0
2016-07-04 01:00:00 0
2016-07-04 02:00:00 0
2016-07-04 03:00:00 0
2016-07-04 04:00:00 0
2016-07-04 05:00:00 0
2016-07-04 06:00:00 0
2016-07-04 07:00:00 0
2016-07-04 08:00:00 0
2016-07-04 09:00:00 60
2016-07-04 10:00:00 60
2016-07-04 11:00:00 60
2016-07-04 12:00:00 0
2016-07-04 13:00:00 30
2016-07-04 14:00:00 60
2016-07-04 15:00:00 30
2016-07-04 16:00:00 0
2016-07-04 17:00:00 0
2016-07-04 18:00:00 0
2016-07-04 19:00:00 0
2016-07-04 20:00:00 0
2016-07-04 21:00:00 0
2016-07-04 22:00:00 0
2016-07-04 23:00:00 0
2016-07-05 00:00:00 0
2016-07-05 01:00:00 0
2016-07-05 02:00:00 0
2016-07-05 03:00:00 0
2016-07-05 04:00:00 0
2016-07-05 05:00:00 0
2016-07-05 06:00:00 0
2016-07-05 07:00:00 0
2016-07-05 08:00:00 0
2016-07-05 09:00:00 60
2016-07-05 10:00:00 0
2016-07-05 11:00:00 0
2016-07-05 12:00:00 0
2016-07-05 13:00:00 0
2016-07-05 14:00:00 30
2016-07-05 15:00:00 0
2016-07-05 16:00:00 0
2016-07-05 17:00:00 0
2016-07-05 18:00:00 0
2016-07-05 19:00:00 0
2016-07-05 20:00:00 0
2016-07-05 21:00:00 0
2016-07-05 22:00:00 0
2016-07-05 23:00:00 0
2016-07-06 00:00:00 0
2016-07-06 01:00:00 0
2016-07-06 02:00:00 0
2016-07-06 03:00:00 0
2016-07-06 04:00:00 0
2016-07-06 05:00:00 0
2016-07-06 06:00:00 0
2016-07-06 07:00:00 0
2016-07-06 08:00:00 0
2016-07-06 09:00:00 60
2016-07-06 10:00:00 0
2016-07-06 11:00:00 0
2016-07-06 12:00:00 30
2016-07-06 13:00:00 60
2016-07-06 14:00:00 0
2016-07-06 15:00:00 60
2016-07-06 16:00:00 0
2016-07-06 17:00:00 0
2016-07-06 18:00:00 0
2016-07-06 19:00:00 0
2016-07-06 20:00:00 0
2016-07-06 21:00:00 0
2016-07-06 22:00:00 0
2016-07-06 23:00:00 0
2016-07-07 00:00:00 0

1 个答案:

答案 0 :(得分:1)

一种可能性是创建一个以分钟为索引的时间序列(下面为s),用于记录您是否在该分钟内参加会议,然后按小时重新采样。要匹配您想要的输出,您可以调整索引s的开始和结束时间。

import io
import pandas as pd

data = io.StringIO('''\
Start,End,Duration,Subject
0,07/04/16 10:30:00,07/04/16 11:00:00,30,Inspirational Poster Design Session
1,07/04/16 15:00:00,07/04/16 15:30:00,30,Corporate Speak Do's and Don'ts
2,07/04/16 09:00:00,07/04/16 12:00:00,180,Metrics or Matrix -Panel Discussion
3,07/04/16 13:30:00,07/04/16 15:00:00,90,"Do More with Less" kickoff party
4,07/05/16 09:00:00,07/05/16 10:00:00,60,Fiscal or Physical -Panel Discussion
5,07/05/16 14:00:00,07/05/16 14:30:00,30,"Why we can't have nice thing" training video
6,07/06/16 15:00:00,07/06/16 16:00:00,60,One-on-One with manager -Panel Discussion
7,07/06/16 09:00:00,07/06/16 10:00:00,60,Fireing for Performance leadership session
8,07/06/16 13:00:00,07/06/16 14:00:00,60,Birthday Cake in the conference room *MANDATORY*
9,07/06/16 12:30:00,07/06/16 13:30:00,60,Obligatory lunchtime meeting because it was the only time everyone
''')
df = pd.read_csv(data, usecols=['Start', 'End', 'Subject'])
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])

# Ranges in datetime indices include the right endpoint
tdel = pd.Timedelta('1min')

s = pd.Series(False, index=pd.date_range(start=df['Start'].min(),
                                         end=df['End'].max()-tdel,
                                         freq='min'))
for _, meeting in df.iterrows():
    s[meeting['Start'] : meeting['End']-tdel] = True

result = s.resample('1H').sum().astype(int)
print(result)

输出:

2016-07-04 09:00:00    60
2016-07-04 10:00:00    60
2016-07-04 11:00:00    60
2016-07-04 12:00:00     0
2016-07-04 13:00:00    30
2016-07-04 14:00:00    60
2016-07-04 15:00:00    30
2016-07-04 16:00:00     0
2016-07-04 17:00:00     0
2016-07-04 18:00:00     0
2016-07-04 19:00:00     0
2016-07-04 20:00:00     0
2016-07-04 21:00:00     0
2016-07-04 22:00:00     0
2016-07-04 23:00:00     0
2016-07-05 00:00:00     0
2016-07-05 01:00:00     0
2016-07-05 02:00:00     0
2016-07-05 03:00:00     0
2016-07-05 04:00:00     0
2016-07-05 05:00:00     0
2016-07-05 06:00:00     0
2016-07-05 07:00:00     0
2016-07-05 08:00:00     0
2016-07-05 09:00:00    60
2016-07-05 10:00:00     0
2016-07-05 11:00:00     0
2016-07-05 12:00:00     0
2016-07-05 13:00:00     0
2016-07-05 14:00:00    30
2016-07-05 15:00:00     0
2016-07-05 16:00:00     0
2016-07-05 17:00:00     0
2016-07-05 18:00:00     0
2016-07-05 19:00:00     0
2016-07-05 20:00:00     0
2016-07-05 21:00:00     0
2016-07-05 22:00:00     0
2016-07-05 23:00:00     0
2016-07-06 00:00:00     0
2016-07-06 01:00:00     0
2016-07-06 02:00:00     0
2016-07-06 03:00:00     0
2016-07-06 04:00:00     0
2016-07-06 05:00:00     0
2016-07-06 06:00:00     0
2016-07-06 07:00:00     0
2016-07-06 08:00:00     0
2016-07-06 09:00:00    60
2016-07-06 10:00:00     0
2016-07-06 11:00:00     0
2016-07-06 12:00:00    30
2016-07-06 13:00:00    60
2016-07-06 14:00:00     0
2016-07-06 15:00:00    60
Freq: H, dtype: int64