我使用每小时启动的产品的不可用性数据&每个产品不可用的期间结束,以下是一个例子:
import pandas as pd
import datetime as dt
unavability = pd.DataFrame([[dt.datetime(2017, 10, 19,11), dt.datetime(2017, 10, 19,12),'broom'],
[dt.datetime(2017, 10, 19,9),dt.datetime(2017, 10, 19,10),'broom'],
[dt.datetime(2017, 10, 19,1), dt.datetime(2017, 10, 19,2),'bike'],
[dt.datetime(2017, 10, 19,22),dt.datetime(2017, 10, 20,3),'bike']],
columns=['start_date', 'end_date','product'])
print unavability
start_date end_date product
0 2017-10-19 11:00:00 2017-10-19 12:00:00 broom
1 2017-10-19 09:00:00 2017-10-19 10:00:00 broom
2 2017-10-19 01:00:00 2017-10-19 02:00:00 bike
3 2017-10-19 22:00:00 2017-10-20 03:00:00 bike
我希望将数据分组为每个日期的不可用比例&产品,所以我想将上面的Data Frame
转换成以下内容,请记住,即使不可用时间超过49小时(重复3天),我也希望它能够正常工作
desired=pd.DataFrame([[dt.datetime(2017, 10, 19),'broom',22/24.0],#2 houres of unavalability
[dt.datetime(2017, 10, 20),'broom',24/24.0], #Product fully available at that day
[dt.datetime(2017, 10, 19),'bike',22/24.0], # 2 hour of unavalability - from 22 to 24
[dt.datetime(2017, 10, 20),'bike',21/24.0]], # 3 hour of unavalability - from 00 to 3
columns=['date', 'product','avalability_proportion'])
print desired
date product avalability_proportion
0 2017-10-19 broom 0.916667
1 2017-10-20 broom 1.000000
2 2017-10-19 bike 0.916667
3 2017-10-20 bike 0.875000
一些困难: 我想创建一个转换,为所有可用产品创建理论小时数,如下所示:Fill missing timeseries data using pandas or numpy,然后创建原始数据的连接,以及一些如何填充它,不确定它是否聪明之一。
任何有关这方面的帮助都会很棒,提前谢谢!
答案 0 :(得分:1)
我愚蠢的解决方案,希望这会有所帮助:
df = unavability
# if date is changed, remember changed rows
df['is_date_changed'] = df.start_date.dt.date != df.end_date.dt.date
df.loc[df.is_date_changed,'intermediate_date'] = pd.to_datetime(df.end_date.dt.date)
df_date_is_changed = df.loc[df.is_date_changed]
df_date_not_changed = df.loc[~df.is_date_changed]
# expand every changed row to two,
# and append those rows to the date_not_changed dataframe.
# for example,
# 2017-10-19 22:00:00 2017-10-20 03:00:00
# will be expand into two rows:
# 2017-10-19 22:00:00 2017-10-20 00:00:00
# 2017-10-20 00:00:00 2017-10-20 03:00:00
for idx,row in df_date_is_changed.iterrows():
row1 = [row['start_date'],row['intermediate_date'],row['product'],None,None]
df_date_not_changed.loc[-1] = row1
df_date_not_changed.index = df_date_not_changed.index + 1
row2 = [row['intermediate_date'],row['end_date'],row['product'],None,None]
df_date_not_changed.loc[-1] = row2
df_date_not_changed.index = df_date_not_changed.index + 1
df = df_date_not_changed
df['date'] = df.apply(
lambda x:min(x['start_date'],x['end_date']),
axis=1)
df.date = df.date.dt.date
df['time_delta'] = df.end_date - df.start_date
df.groupby(['product','date']).agg({'time_delta':'sum'})