预先填充熊猫的季节性数据

时间:2019-08-17 21:21:18

标签: python pandas

我每小时观察几个表现出每日季节性的变量。我想在24小时之前用相应变量的值填充所有缺少的值。

理想情况下,我的函数将从最旧到最新填充缺失的值。因此,如果有25个连续的缺失值,则用与第一个缺失值相同的值填充第25个缺失值。在这种情况下,使用Series.map()失败。

                         value  desired_output
hour                                          
2019-08-17 00:00:00  58.712986       58.712986
2019-08-17 01:00:00  28.904234       28.904234
2019-08-17 02:00:00  14.275149       14.275149
2019-08-17 03:00:00  58.777087       58.777087
2019-08-17 04:00:00  95.964955       95.964955
2019-08-17 05:00:00  64.971372       64.971372
2019-08-17 06:00:00  95.759469       95.759469
2019-08-17 07:00:00  98.675457       98.675457
2019-08-17 08:00:00  77.510319       77.510319
2019-08-17 09:00:00  56.492446       56.492446
2019-08-17 10:00:00  90.968924       90.968924
2019-08-17 11:00:00  66.647501       66.647501
2019-08-17 12:00:00   7.756725        7.756725
2019-08-17 13:00:00  49.328135       49.328135
2019-08-17 14:00:00  28.634033       28.634033
2019-08-17 15:00:00  65.157161       65.157161
2019-08-17 16:00:00  93.127539       93.127539
2019-08-17 17:00:00  98.806335       98.806335
2019-08-17 18:00:00  94.789761       94.789761
2019-08-17 19:00:00  63.518037       63.518037
2019-08-17 20:00:00  89.524433       89.524433
2019-08-17 21:00:00  48.076081       48.076081
2019-08-17 22:00:00   5.027928        5.027928
2019-08-17 23:00:00   0.417763        0.417763
2019-08-18 00:00:00  29.933627       29.933627
2019-08-18 01:00:00  61.726948       61.726948
2019-08-18 02:00:00        NaN       14.275149
2019-08-18 03:00:00        NaN       58.777087
2019-08-18 04:00:00        NaN       95.964955
2019-08-18 05:00:00        NaN       64.971372
2019-08-18 06:00:00        NaN       95.759469
2019-08-18 07:00:00        NaN       98.675457
2019-08-18 08:00:00        NaN       77.510319
2019-08-18 09:00:00        NaN       56.492446
2019-08-18 10:00:00        NaN       90.968924
2019-08-18 11:00:00        NaN       66.647501
2019-08-18 12:00:00        NaN        7.756725
2019-08-18 13:00:00        NaN       49.328135
2019-08-18 14:00:00        NaN       28.634033
2019-08-18 15:00:00        NaN       65.157161
2019-08-18 16:00:00        NaN       93.127539
2019-08-18 17:00:00        NaN       98.806335
2019-08-18 18:00:00        NaN       94.789761
2019-08-18 19:00:00        NaN       63.518037
2019-08-18 20:00:00        NaN       89.524433
2019-08-18 21:00:00        NaN       48.076081
2019-08-18 22:00:00        NaN        5.027928
2019-08-18 23:00:00        NaN        0.417763
2019-08-19 00:00:00        NaN       29.933627
2019-08-19 01:00:00        NaN       61.726948
2019-08-19 02:00:00        NaN       14.275149
2019-08-19 03:00:00        NaN       58.777087
2019-08-19 04:00:00        NaN       95.964955
2019-08-19 05:00:00        NaN       64.971372
2019-08-19 06:00:00        NaN       95.759469
2019-08-19 07:00:00        NaN       98.675457
2019-08-19 08:00:00        NaN       77.510319
2019-08-19 09:00:00        NaN       56.492446
2019-08-19 10:00:00        NaN       90.968924
2019-08-19 11:00:00        NaN       66.647501
2019-08-19 12:00:00        NaN        7.756725
2019-08-19 13:00:00  61.457913       61.457913
2019-08-19 14:00:00  52.429383       52.429383
2019-08-19 15:00:00  79.016485       79.016485
2019-08-19 16:00:00  77.724758       77.724758
2019-08-19 17:00:00  62.205810       62.205810
2019-08-19 18:00:00  15.841707       15.841707
2019-08-19 19:00:00  72.196028       72.196028
2019-08-19 20:00:00   5.497441        5.497441
2019-08-19 21:00:00  30.737596       30.737596
2019-08-19 22:00:00  65.550690       65.550690
2019-08-19 23:00:00   3.543332        3.543332

import pandas as pd
from dateutil.relativedelta import relativedelta as rel_delta

df['isna'] = df['value'].isna()
df['value'] = df.index.map(lambda t: df.at[t - rel_delta(hours=24), 'value'] if df.at[t,'isna'] and t - rel_delta(hours=24) >= df.index.min() else df.at[t, 'value'])

完成这种幼稚的前向填充的最有效方法是什么?

3 个答案:

答案 0 :(得分:3)

IIUC,只需groupby时间和ffill()

df['resuts'] = df.groupby(df.hour.dt.time).value.ffill()

如果hour是您的索引,只需执行df.index.time

检查:

>>> (df['results'] == df['desired_output']).all()
True

答案 1 :(得分:0)

这行不通吗?

df['value'] = df['value'].fillna(df.index.hour)

答案 2 :(得分:0)

将日期和时间分成两列作为字符串。称为df

      Date       Time         Value
0   2019-08-17  00:00:00     58.712986  
1   2019-08-17  01:00:00     28.904234  
2   2019-08-17  02:00:00     14.275149  
3   2019-08-17  03:00:00     58.777087   
4   2019-08-17  04:00:00     95.964955   

然后进行数据重塑,将“时间”(Time)设置为列标题,然后每小时进行一次填充。

(df重塑)

Date       00:00:00     01:00:00    02:00:00    03:00:00    04:00:00 
2019-08-17  58.712986   28.904234   14.275149   58.777087   95.964955
2019-08-18  29.933627   61.726948       NaN       NaN        NaN
2019-08-19  NaN          NaN            NaN       NaN        NaN

(df填充)

Date        00:00:00    01:00:00    02:00:00    03:00:00    04:00:00 
2019-08-17  58.712986   28.904234   14.275149   58.777087   95.964955
2019-08-18  29.933627   61.726948   14.275149   58.777087   95.964955
2019-08-19  29.933627   61.726948   14.275149   58.777087   95.964955

(代码)

(df.set_index(['Date','Time')['Value']
   .unstack()
   .ffill()
   .stack()
   .reset_index(name='Value')