如何有效地将数据向前/向后一半填充数据帧中的间隙?

时间:2019-07-05 17:08:17

标签: python pandas dataframe

我有一个以日期时间为索引的数据框。索引中存在一些间隙,因此我将其升采样为仅具有1秒的间隙。我想通过一半向前填充(从间隙的左侧)和一半向后填充(从间隙的右侧)来填充间隙。

输入:

2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:10    4

上采样的输入,具有10秒:

2000-01-01 00:00:00    0.0
2000-01-01 00:00:10    NaN
2000-01-01 00:00:20    NaN
2000-01-01 00:00:30    NaN
2000-01-01 00:00:40    NaN
2000-01-01 00:00:50    NaN
2000-01-01 00:01:00    1.0
2000-01-01 00:01:10    NaN
2000-01-01 00:01:20    NaN
2000-01-01 00:01:30    NaN
2000-01-01 00:01:40    NaN
2000-01-01 00:01:50    NaN
2000-01-01 00:02:00    2.0
2000-01-01 00:02:10    NaN
2000-01-01 00:02:20    NaN
2000-01-01 00:02:30    NaN
2000-01-01 00:02:40    NaN
2000-01-01 00:02:50    NaN
2000-01-01 00:03:00    3.0
2000-01-01 00:04:10    4.0

我想要的输出:

2000-01-01 00:00:00    0.0
2000-01-01 00:00:10    0.0
2000-01-01 00:00:20    0.0
2000-01-01 00:00:30    0.0
2000-01-01 00:00:40    1.0
2000-01-01 00:00:50    1.0
2000-01-01 00:01:00    1.0
2000-01-01 00:01:10    1.0
2000-01-01 00:01:20    1.0
2000-01-01 00:01:30    1.0
2000-01-01 00:01:40    2.0
2000-01-01 00:01:50    2.0
2000-01-01 00:02:00    2.0
2000-01-01 00:02:10    2.0
2000-01-01 00:02:20    2.0
2000-01-01 00:02:30    2.0
2000-01-01 00:02:40    3.0
2000-01-01 00:02:50    3.0
2000-01-01 00:03:00    3.0
2000-01-01 00:04:10    4.0

我设法通过在上采样后获得间隙的边缘,对所有间隙执行前向填充,然后用右侧边缘的值来更新右侧的一半,从而获得了所需的结果,但是由于我的数据太大了,因为我的某些文件有1M的空白需要填补,因此它需要永久运行。我基本上使用一个遍历所有已确定差距的for循环来完成此操作。

有没有一种方法可以更快地完成?

谢谢!

编辑: 我只想对时间差小于或等于给定值的空白进行上采样并填充空白,在本示例中,仅间隔不超过1分钟,因此最后两行之间不会进行上采样和填充。

2 个答案:

答案 0 :(得分:3)

如果数据相距1分钟,则可以执行以下操作:

df.set_index(0).asfreq('10S').ffill(limit=3).bfill(limit=2)

输出:

                       1
0                       
2000-01-01 00:00:00  0.0
2000-01-01 00:00:10  0.0
2000-01-01 00:00:20  0.0
2000-01-01 00:00:30  0.0
2000-01-01 00:00:40  1.0
2000-01-01 00:00:50  1.0
2000-01-01 00:01:00  1.0
2000-01-01 00:01:10  1.0
2000-01-01 00:01:20  1.0
2000-01-01 00:01:30  1.0
2000-01-01 00:01:40  2.0
2000-01-01 00:01:50  2.0
2000-01-01 00:02:00  2.0
2000-01-01 00:02:10  2.0
2000-01-01 00:02:20  2.0
2000-01-01 00:02:30  2.0
2000-01-01 00:02:40  3.0
2000-01-01 00:02:50  3.0
2000-01-01 00:03:00  3.0

答案 1 :(得分:2)

设置

ts = pd.Series([0, 1, 2, 3], pd.date_range('2000-01-01', periods=4, freq='min'))

merge_asofdirection='nearest'

pd.merge_asof(
    ts.asfreq('10s').to_frame('left'),
    ts.to_frame('right'),
    left_index=True,
    right_index=True,
    direction='nearest'
)

                     left  right
2000-01-01 00:00:00   0.0      0
2000-01-01 00:00:10   NaN      0
2000-01-01 00:00:20   NaN      0
2000-01-01 00:00:30   NaN      0
2000-01-01 00:00:40   NaN      1
2000-01-01 00:00:50   NaN      1
2000-01-01 00:01:00   1.0      1
2000-01-01 00:01:10   NaN      1
2000-01-01 00:01:20   NaN      1
2000-01-01 00:01:30   NaN      1
2000-01-01 00:01:40   NaN      2
2000-01-01 00:01:50   NaN      2
2000-01-01 00:02:00   2.0      2
2000-01-01 00:02:10   NaN      2
2000-01-01 00:02:20   NaN      2
2000-01-01 00:02:30   NaN      2
2000-01-01 00:02:40   NaN      3
2000-01-01 00:02:50   NaN      3
2000-01-01 00:03:00   3.0      3

reindexmethod='nearest'

ts.reindex(ts.asfreq('10s').index, method='nearest')

2000-01-01 00:00:00    0
2000-01-01 00:00:10    0
2000-01-01 00:00:20    0
2000-01-01 00:00:30    1
2000-01-01 00:00:40    1
2000-01-01 00:00:50    1
2000-01-01 00:01:00    1
2000-01-01 00:01:10    1
2000-01-01 00:01:20    1
2000-01-01 00:01:30    2
2000-01-01 00:01:40    2
2000-01-01 00:01:50    2
2000-01-01 00:02:00    2
2000-01-01 00:02:10    2
2000-01-01 00:02:20    2
2000-01-01 00:02:30    3
2000-01-01 00:02:40    3
2000-01-01 00:02:50    3
2000-01-01 00:03:00    3
Freq: 10S, dtype: int64

注意:关于如何确定最近距离的决定在两种解决方案之间会产生略有不同的结果。

pd.merge_asof(
    ts.asfreq('10s').to_frame('left'),
    ts.to_frame('merge_asof'),
    left_index=True,
    right_index=True,
    direction='nearest'
).assign(reindex=ts.reindex(ts.asfreq('10s').index, method='nearest'))

                     left  merge_asof  reindex
2000-01-01 00:00:00   0.0           0        0
2000-01-01 00:00:10   NaN           0        0
2000-01-01 00:00:20   NaN           0        0
2000-01-01 00:00:30   NaN           0        1  # This row is different
2000-01-01 00:00:40   NaN           1        1
2000-01-01 00:00:50   NaN           1        1
2000-01-01 00:01:00   1.0           1        1
2000-01-01 00:01:10   NaN           1        1
2000-01-01 00:01:20   NaN           1        1
2000-01-01 00:01:30   NaN           1        2  # This row is different
2000-01-01 00:01:40   NaN           2        2
2000-01-01 00:01:50   NaN           2        2
2000-01-01 00:02:00   2.0           2        2
2000-01-01 00:02:10   NaN           2        2
2000-01-01 00:02:20   NaN           2        2
2000-01-01 00:02:30   NaN           2        3  # This row is different
2000-01-01 00:02:40   NaN           3        3 
2000-01-01 00:02:50   NaN           3        3
2000-01-01 00:03:00   3.0           3        3