我有一个数据帧,它告诉我触发了多少个信号(例如每小时的计数值),有什么方法可以求逆?像没有触发的有多少一样,那意味着在没有信号的那一小时内为零?
例如:
In [32]: datum.head()
Out[32]:
item_name name date_time pred_value
476 alpha model1 2019-12-01 06:00:00 2
477 alpha model1 2019-12-01 07:00:00 2
478 alpha model2 2019-12-01 08:00:00 2
479 beta model3 2019-12-01 09:00:00 2
480 beta model1 2019-12-01 10:00:00 2
在上面的示例中,我们可以看到6th/7th
的{{1}}小时有数据计数,但是2019-12-01
之后没有数据,类似的alpha
则是{{ 1}}小时。
在剩下的时间里,我需要用零填充数据框。
我需要重新创建一个新的数据框,如下所示:
beta
像这样,我们有多个9th/10th
( item_name name date_time pred_value
0 alpha model1 2019-12-01 00:00:00 0
1 alpha model1 2019-12-01 01:00:00 0
2 alpha model1 2019-12-01 02:00:00 0
3 alpha model1 2019-12-01 03:00:00 0
4 alpha model1 2019-12-01 04:00:00 0
5 alpha model1 2019-12-01 05:00:00 0
6 alpha model1 2019-12-01 06:00:00 2
7 alpha model1 2019-12-01 07:00:00 2
...
23 alpha model1 2019-12-01 23:00:00 0
24 alpha model1 2019-12-02 00:00:00 0
.
.
478 alpha model2 2019-12-01 00:00:00 0
478 alpha model2 2019-12-01 01:00:00 0
478 alpha model2 2019-12-01 02:00:00 0
478 alpha model2 2019-12-01 03:00:00 0
)和多个item_name
(alpha/beta/...
)。
答案 0 :(得分:1)
使用DataFrame.reindex
为item_name
和name
以及日期时间添加所有缺少的值组合:
df['date_time'] = pd.to_datetime(df['date_time'])
dates = pd.date_range(df['date_time'].min().floor('d'),
df['date_time'].max().floor('d') + pd.Timedelta(23, 'H'),
freq='H')
mux = pd.MultiIndex.from_product([df['item_name'].unique(),
df['name'].unique(),
dates], names=['item_name','name','date_time'])
df = df.set_index(['item_name','name','date_time']).reindex(mux, fill_value=0).reset_index()
print (df)
item_name name date_time pred_value
0 alpha model1 2019-12-01 00:00:00 0
1 alpha model1 2019-12-01 01:00:00 0
2 alpha model1 2019-12-01 02:00:00 0
3 alpha model1 2019-12-01 03:00:00 0
4 alpha model1 2019-12-01 04:00:00 0
.. ... ... ... ...
139 beta model3 2019-12-01 19:00:00 0
140 beta model3 2019-12-01 20:00:00 0
141 beta model3 2019-12-01 21:00:00 0
142 beta model3 2019-12-01 22:00:00 0
143 beta model3 2019-12-01 23:00:00 0
另一个想法是,是否需要为每个组合item_name
和name
添加缺少的日期时间:
df['date_time'] = pd.to_datetime(df['date_time'])
dates = pd.date_range(df['date_time'].min().floor('d'),
df['date_time'].max().floor('d') + pd.Timedelta(23, 'H'),
freq='H', name='date_time')
df2 = (df.set_index('date_time')
.groupby(['item_name','name'])['pred_value']
.apply(lambda x: x.reindex(dates, fill_value=0))
.reset_index())
print (df2)
item_name name date_time pred_value
0 alpha model1 2019-12-01 00:00:00 0
1 alpha model1 2019-12-01 01:00:00 0
2 alpha model1 2019-12-01 02:00:00 0
3 alpha model1 2019-12-01 03:00:00 0
4 alpha model1 2019-12-01 04:00:00 0
.. ... ... ... ...
91 beta model3 2019-12-01 19:00:00 0
92 beta model3 2019-12-01 20:00:00 0
93 beta model3 2019-12-01 21:00:00 0
94 beta model3 2019-12-01 22:00:00 0
95 beta model3 2019-12-01 23:00:00 0
[96 rows x 4 columns]
如果前2列的每种组合的日期时间范围不同,请使用:
df['date_time'] = pd.to_datetime(df['date_time'])
def f(x):
dates = pd.date_range(x.index.min().floor('d'),
x.index.max().floor('d') + pd.Timedelta(23, 'H'),
freq='H', name='date_time')
return x.reindex(dates, fill_value=0)
df3 = (df.set_index('date_time')
.groupby(['item_name','name'])['pred_value']
.apply(f)
.reset_index())
print (df3)
item_name name date_time pred_value
0 alpha model1 2019-12-01 00:00:00 0
1 alpha model1 2019-12-01 01:00:00 0
2 alpha model1 2019-12-01 02:00:00 0
3 alpha model1 2019-12-01 03:00:00 0
4 alpha model1 2019-12-01 04:00:00 0
.. ... ... ... ...
91 beta model3 2019-12-01 19:00:00 0
92 beta model3 2019-12-01 20:00:00 0
93 beta model3 2019-12-01 21:00:00 0
94 beta model3 2019-12-01 22:00:00 0
95 beta model3 2019-12-01 23:00:00 0
[96 rows x 4 columns]