我的df信息如下:
df
datetime col_y col_x1 col_x2
2019-01-01 00:00:00 0 0 0
2019-01-01 01:00:00 0 0 0
2019-01-01 02:00:00 0 0 0
2019-01-01 03:00:00 0 0 0
2019-01-01 04:00:00 0 0 0
2019-01-01 05:00:00 0 0 0
2019-01-01 06:00:00 10 5 6
2019-01-01 07:00:00 15 10 9
2019-01-01 08:00:00 17 NAN NAN
2019-01-01 09:00:00 20 17 18
...
2019-01-01 15:00:00 120 40 50
2019-01-01 16:00:00 170 NAN NAN
2019-01-01 17:00:00 200 60 90
2019-01-01 18:00:00 150 30 45
2019-01-01 19:00:00 100 25 30
2019-01-01 20:00:00 90 NAN NAN
2019-01-01 21:00:00 70 NAN NAN
2019-01-01 22:00:00 60 10 10
2019-01-01 23:00:00 0 0 0
由于每个间隔可能包含col_y的不同比例;因此,我需要取平均值,它是最接近的非缺失值。
x(t) = [x(t+1) + x(t-1)]/2
e.g.
col_x1 (10 + 17) /2 = 13.5
2019-01-01 08:00:00 17 NAN NAN
| |
\ /
\ /
2019-01-01 08:00:00 17 13.5 14
e.g.
2019-01-01 19:00:00 100 25 30
2019-01-01 20:00:00 90 NAN NAN
2019-01-01 21:00:00 70 NAN NAN
2019-01-01 22:00:00 60 10 10
| |
\ /
\ /
2019-01-01 19:00:00 100 25 30
2019-01-01 20:00:00 90 12.5 20
2019-01-01 21:00:00 70 12.5 20
2019-01-01 22:00:00 60 10 10
因此,我希望最终的df可以显示如下:
df
datetime col_y col_x1 col_x2
2019-01-01 00:00:00 0 0 0
2019-01-01 01:00:00 0 0 0
2019-01-01 02:00:00 0 0 0
2019-01-01 03:00:00 0 0 0
2019-01-01 04:00:00 0 0 0
2019-01-01 05:00:00 0 0 0
2019-01-01 06:00:00 10 5 6
2019-01-01 07:00:00 15 10 9
2019-01-01 08:00:00 17 13.5 14
2019-01-01 09:00:00 20 17 18
...
2019-01-01 15:00:00 120 40 50
2019-01-01 16:00:00 170 50 70
2019-01-01 17:00:00 200 60 90
2019-01-01 18:00:00 150 30 45
2019-01-01 19:00:00 100 25 30
2019-01-01 20:00:00 90 12.5 20
2019-01-01 21:00:00 70 12.5 20
2019-01-01 22:00:00 60 10 10
2019-01-01 23:00:00 0 0 0
在这种情况下,有什么功能可以帮助我更方便地处理丢失的数据吗?