这是我拥有的示例数据集的代码
data={'ID':[4,4,4,4,22,22,23,25,29],
'Zone':[32,34,21,34,27,29,32,75,9],
'checkin_datetime':['04-01-2019 13:07','04-01-2019 13:09','04-01-2019 14:06','04-01-2019 14:55','04-01-2019 20:23'
,'04-01-2019 21:38','04-01-2019 21:38','04-01-2019 23:22','04-02-2019 01:00'],
'checkout_datetime':['04-01-2019 13:09','04-01-2019 13:12','04-01-2019 14:07','04-01-2019 15:06','04-01-2019 21:32'
,'04-01-2019 21:42','04-01-2019 21:45','04-02-2019 00:23','04-02-2019 06:15']
}
df = pd.DataFrame(data,columns= ['ID','Zone', 'checkin_datetime','checkout_datetime'])
df['checkout_datetime'] = pd.to_datetime(df['checkout_datetime'])
df['checkin_datetime'] = pd.to_datetime(df['checkin_datetime'])
使用此数据集,我试图创建以下数据集
Checked_in_hour ID Zone checked_in_minutes
01-04-2019 13:00 4 32 2
01-04-2019 13:00 4 34 3
01-04-2019 14:00 4 21 1
01-04-2019 14:00 4 34 5
01-04-2019 15:00 4 34 6
01-04-2019 20:00 22 27 37
01-04-2019 20:00 22 27 8
01-04-2019 20:00 22 27 37
01-04-2019 21:00 22 29 4
01-04-2019 21:00 23 32 7
01-04-2019 23:00 25 75 38
02-04-2019 00:00 25 75 24
02-04-2019 01:00 29 9 60
02-04-2019 02:00 29 9 60
02-04-2019 03:00 29 9 60
02-04-2019 04:00 29 9 60
02-04-2019 05:00 29 9 60
02-04-2019 06:00 29 9 16
签入小时数是通过减去checkin_datetime和checkout_datetime来计算的,并且该时间按小时和时区分组
到目前为止,这是我在Checked_in_hour级别进行计算的代码,需要在Zone Variable中添加
#working logic
df2 = pd.DataFrame(
index=pd.DatetimeIndex(
start=df['checkin_datetime'].min(),
end=df['checkout_datetime'].max(),freq='1T'),
columns = ['is_checked_in','ID'], data=0)
for index, row in df.iterrows():
df2['is_checked_in'][row['checkin_datetime']:row['checkout_datetime']] = 1
df2['ID'][row['checkin_datetime']:row['checkout_datetime']] = row['ID']
df3 = df2.resample('1H').aggregate({'is_checked_in': sum,'ID':max})
答案 0 :(得分:0)
不确定这是否有效,但应该可以。
import pandas as pd
from datetime import timedelta
def group_into_hourly_buckets(df):
df['duration'] = df['checkout_datetime'] - df['checkin_datetime']
grouped_data = []
for idx, row in df.iterrows():
dur = row['duration'].seconds//60
start_time = row['checkin_datetime']
hours_ = 0
while dur > 0:
_data = {}
_data['Checked_in_hour'] = start_time.floor('H') + timedelta(hours=hours_)
time_spent_in_window = min(dur, 60)
if (hours_ == 0):
time_spent_in_window = min(time_spent_in_window, ((start_time.ceil('H') - start_time).seconds)//60)
_data['checked_in_minutes'] = time_spent_in_window
_data['ID'] = row['ID']
_data['Zone'] = row['Zone']
dur -= time_spent_in_window
hours_ += 1
grouped_data.append(_data)
return pd.DataFrame(grouped_data)