通过start& amp;组合小时标记的数据。使用Pandas DF将数据结束到比例日常数据

时间:2017-11-01 10:21:17

标签: pandas

我使用每小时启动的产品的不可用性数据&每个产品不可用的期间结束,以下是一个例子:

import pandas as pd
import datetime as dt
unavability = pd.DataFrame([[dt.datetime(2017, 10, 19,11), dt.datetime(2017, 10, 19,12),'broom'],
                       [dt.datetime(2017, 10, 19,9),dt.datetime(2017, 10, 19,10),'broom'], 
                       [dt.datetime(2017, 10, 19,1), dt.datetime(2017, 10, 19,2),'bike'],
                       [dt.datetime(2017, 10, 19,22),dt.datetime(2017, 10, 20,3),'bike']],
                      columns=['start_date', 'end_date','product'])
print unavability
      start_date            end_date product
0 2017-10-19 11:00:00 2017-10-19 12:00:00   broom
1 2017-10-19 09:00:00 2017-10-19 10:00:00   broom
2 2017-10-19 01:00:00 2017-10-19 02:00:00    bike
3 2017-10-19 22:00:00 2017-10-20 03:00:00    bike

我希望将数据分组为每个日期的不可用比例&产品,所以我想将上面的Data Frame转换成以下内容,请记住,即使不可用时间超过49小时(重复3天),我也希望它能够正常工作

desired=pd.DataFrame([[dt.datetime(2017, 10, 19),'broom',22/24.0],#2 houres of unavalability
              [dt.datetime(2017, 10, 20),'broom',24/24.0], #Product fully available at that day
              [dt.datetime(2017, 10, 19),'bike',22/24.0], # 2 hour of unavalability - from 22 to 24
              [dt.datetime(2017, 10, 20),'bike',21/24.0]], # 3 hour of unavalability - from 00 to 3
              columns=['date', 'product','avalability_proportion'])
print desired
        date product  avalability_proportion
0 2017-10-19   broom                0.916667
1 2017-10-20   broom                1.000000
2 2017-10-19    bike                0.916667
3 2017-10-20    bike                0.875000

一些困难: 我想创建一个转换,为所有可用产品创建理论小时数,如下所示:Fill missing timeseries data using pandas or numpy,然后创建原始数据的连接,以及一些如何填充它,不确定它是否聪明之一。

任何有关这方面的帮助都会很棒,提前谢谢!

1 个答案:

答案 0 :(得分:1)

我愚蠢的解决方案,希望这会有所帮助:

df = unavability
# if date is changed, remember changed rows
df['is_date_changed'] = df.start_date.dt.date != df.end_date.dt.date
df.loc[df.is_date_changed,'intermediate_date'] = pd.to_datetime(df.end_date.dt.date)
df_date_is_changed = df.loc[df.is_date_changed]
df_date_not_changed = df.loc[~df.is_date_changed]

# expand every changed row to two, 
# and append those rows to the date_not_changed dataframe.
# for example,
# 2017-10-19 22:00:00   2017-10-20 03:00:00
# will be expand into two rows:
# 2017-10-19 22:00:00   2017-10-20 00:00:00
# 2017-10-20 00:00:00   2017-10-20 03:00:00
for idx,row in df_date_is_changed.iterrows():
    row1 = [row['start_date'],row['intermediate_date'],row['product'],None,None]
    df_date_not_changed.loc[-1] = row1
    df_date_not_changed.index = df_date_not_changed.index + 1
    row2 = [row['intermediate_date'],row['end_date'],row['product'],None,None]
    df_date_not_changed.loc[-1] = row2
    df_date_not_changed.index = df_date_not_changed.index + 1

df = df_date_not_changed
df['date'] = df.apply(
    lambda x:min(x['start_date'],x['end_date']),
axis=1)
df.date = df.date.dt.date
df['time_delta'] = df.end_date - df.start_date

df.groupby(['product','date']).agg({'time_delta':'sum'})