从现有数据重新创建逆数据框

时间:2020-08-17 06:51:31

标签: python pandas datetime

我有一个数据帧,它告诉我触发了多少个信号(例如每小时的计数值),有什么方法可以求逆?像没有触发的有多少一样,那意味着在没有信号的那一小时内为零?

例如:

In [32]: datum.head()                                                                                                                                                                                                                                
Out[32]: 
     item_name  name   date_time           pred_value
476      alpha  model1 2019-12-01 06:00:00 2
477      alpha  model1 2019-12-01 07:00:00 2  
478      alpha  model2 2019-12-01 08:00:00 2  
479      beta   model3 2019-12-01 09:00:00 2  
480      beta   model1 2019-12-01 10:00:00 2  

在上面的示例中,我们可以看到6th/7th的{​​{1}}小时有数据计数,但是2019-12-01之后没有数据,类似的alpha则是{{ 1}}小时。 在剩下的时间里,我需要用零填充数据框。

我需要重新创建一个新的数据框,如下所示:

beta

像这样,我们有多个9th/10th item_name name date_time pred_value 0 alpha model1 2019-12-01 00:00:00 0 1 alpha model1 2019-12-01 01:00:00 0 2 alpha model1 2019-12-01 02:00:00 0 3 alpha model1 2019-12-01 03:00:00 0 4 alpha model1 2019-12-01 04:00:00 0 5 alpha model1 2019-12-01 05:00:00 0 6 alpha model1 2019-12-01 06:00:00 2 7 alpha model1 2019-12-01 07:00:00 2 ... 23 alpha model1 2019-12-01 23:00:00 0 24 alpha model1 2019-12-02 00:00:00 0 . . 478 alpha model2 2019-12-01 00:00:00 0 478 alpha model2 2019-12-01 01:00:00 0 478 alpha model2 2019-12-01 02:00:00 0 478 alpha model2 2019-12-01 03:00:00 0 )和多个item_namealpha/beta/...)。

1 个答案:

答案 0 :(得分:1)

使用DataFrame.reindexitem_namename以及日期时间添加所有缺少的值组合:

df['date_time'] = pd.to_datetime(df['date_time'])

dates = pd.date_range(df['date_time'].min().floor('d'),
                      df['date_time'].max().floor('d') + pd.Timedelta(23, 'H'),
                      freq='H')
mux = pd.MultiIndex.from_product([df['item_name'].unique(),
                                  df['name'].unique(),
                                  dates], names=['item_name','name','date_time'])
df = df.set_index(['item_name','name','date_time']).reindex(mux, fill_value=0).reset_index()
print (df)
    item_name    name           date_time  pred_value
0       alpha  model1 2019-12-01 00:00:00           0
1       alpha  model1 2019-12-01 01:00:00           0
2       alpha  model1 2019-12-01 02:00:00           0
3       alpha  model1 2019-12-01 03:00:00           0
4       alpha  model1 2019-12-01 04:00:00           0
..        ...     ...                 ...         ...
139      beta  model3 2019-12-01 19:00:00           0
140      beta  model3 2019-12-01 20:00:00           0
141      beta  model3 2019-12-01 21:00:00           0
142      beta  model3 2019-12-01 22:00:00           0
143      beta  model3 2019-12-01 23:00:00           0

另一个想法是,是否需要为每个组合item_namename添加缺少的日期时间:

df['date_time'] = pd.to_datetime(df['date_time'])

dates = pd.date_range(df['date_time'].min().floor('d'),
                      df['date_time'].max().floor('d') + pd.Timedelta(23, 'H'),
                      freq='H', name='date_time')
df2 = (df.set_index('date_time')
        .groupby(['item_name','name'])['pred_value']
        .apply(lambda x: x.reindex(dates, fill_value=0))
        .reset_index())
print (df2)
   item_name    name           date_time  pred_value
0      alpha  model1 2019-12-01 00:00:00           0
1      alpha  model1 2019-12-01 01:00:00           0
2      alpha  model1 2019-12-01 02:00:00           0
3      alpha  model1 2019-12-01 03:00:00           0
4      alpha  model1 2019-12-01 04:00:00           0
..       ...     ...                 ...         ...
91      beta  model3 2019-12-01 19:00:00           0
92      beta  model3 2019-12-01 20:00:00           0
93      beta  model3 2019-12-01 21:00:00           0
94      beta  model3 2019-12-01 22:00:00           0
95      beta  model3 2019-12-01 23:00:00           0

[96 rows x 4 columns]

如果前2列的每种组合的日期时间范围不同,请使用:

df['date_time'] = pd.to_datetime(df['date_time'])

def f(x):
    dates = pd.date_range(x.index.min().floor('d'),
                          x.index.max().floor('d') + pd.Timedelta(23, 'H'),
                          freq='H', name='date_time')
    return x.reindex(dates, fill_value=0)
df3 = (df.set_index('date_time')
        .groupby(['item_name','name'])['pred_value']
        .apply(f)
        .reset_index())
print (df3)
 item_name    name           date_time  pred_value
0      alpha  model1 2019-12-01 00:00:00           0
1      alpha  model1 2019-12-01 01:00:00           0
2      alpha  model1 2019-12-01 02:00:00           0
3      alpha  model1 2019-12-01 03:00:00           0
4      alpha  model1 2019-12-01 04:00:00           0
..       ...     ...                 ...         ...
91      beta  model3 2019-12-01 19:00:00           0
92      beta  model3 2019-12-01 20:00:00           0
93      beta  model3 2019-12-01 21:00:00           0
94      beta  model3 2019-12-01 22:00:00           0
95      beta  model3 2019-12-01 23:00:00           0

[96 rows x 4 columns]