熊猫取第二个值的最近值并进行插值

时间:2019-09-09 07:43:27

标签: python pandas dataframe interpolation resampling

我希望以以下格式转换数据帧为例:

>>>df
                      vals
2019-08-10 12:03:05   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:07   NaN
2019-08-10 12:03:08   3.0
2019-08-10 12:03:09   4.0
2019-08-10 12:03:10   NaN
2019-08-10 12:03:11   NaN
2019-08-10 12:03:12   5.0
2019-08-10 12:03:13   NaN
2019-08-10 12:03:14   1.0
2019-08-10 12:03:15   NaN
2019-08-10 12:03:16   NaN
2019-08-10 12:03:17   6.0

分为以下一种:

>>>df
                      vals
2019-08-10 12:03:05   1.0
2019-08-10 12:03:06   1.667
2019-08-10 12:03:07   2.333
2019-08-10 12:03:08   3.0
2019-08-10 12:03:09   3.667
2019-08-10 12:03:10   4.333
2019-08-10 12:03:11   5.0
2019-08-10 12:03:12   3.667
2019-08-10 12:03:13   2.333
2019-08-10 12:03:14   1.0
2019-08-10 12:03:15   2.667
2019-08-10 12:03:16   4.333
2019-08-10 12:03:17   6.0

首先对齐数据框的位置类似于以下内容(取每个第三个值最接近的值):

>>>df
                      vals
2019-08-10 12:03:05   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:07   NaN
2019-08-10 12:03:08   3.0
2019-08-10 12:03:09   NaN
2019-08-10 12:03:10   NaN
2019-08-10 12:03:11   5.0
2019-08-10 12:03:12   NaN
2019-08-10 12:03:13   NaN
2019-08-10 12:03:14   1.0
2019-08-10 12:03:15   NaN
2019-08-10 12:03:16   NaN
2019-08-10 12:03:17   6.0

然后在每个值之间线性内插以生成最终数据帧。如果间隔超过2秒,我只想在这2个值之间进行插值。

这是我到目前为止尝试过的:

df.resample('3s').nearest()

哪个会产生:

>>> df.resample('3s').nearest()
                     vals
2019-08-10 12:03:03   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:09   4.0
2019-08-10 12:03:12   5.0
2019-08-10 12:03:15   NaN

也:

>>> df.resample('2s').nearest()
                     vals
2019-08-10 12:03:04   1.0
2019-08-10 12:03:06   NaN
2019-08-10 12:03:08   3.0
2019-08-10 12:03:10   NaN
2019-08-10 12:03:12   5.0
2019-08-10 12:03:14   1.0
2019-08-10 12:03:16   NaN

这很清楚,最接近是完整的谎言,或者至少是错误的用词,因为最接近10的值很明显是4。而且,2019-08-10 12:03:16的最终值一定是{{1} }。

这只是试图将值与第二个对齐,此后,6.0似乎就可以工作。

感谢您的帮助。

3 个答案:

答案 0 :(得分:1)

我认为您需要base参数,以Resampler.first为索引的第一个值的3(由于3秒)以模为模,来改变采样周期的偏移量:

df['new'] = df.resample('3s', base=df.index[0].second % 3).first()
print (df)
                     vals  new
2019-08-10 12:03:05   1.0  1.0
2019-08-10 12:03:06   NaN  NaN
2019-08-10 12:03:07   NaN  NaN
2019-08-10 12:03:08   3.0  3.0
2019-08-10 12:03:09   4.0  NaN
2019-08-10 12:03:10   NaN  NaN
2019-08-10 12:03:11   NaN  5.0
2019-08-10 12:03:12   5.0  NaN
2019-08-10 12:03:13   NaN  NaN
2019-08-10 12:03:14   1.0  1.0
2019-08-10 12:03:15   NaN  NaN
2019-08-10 12:03:16   NaN  NaN
2019-08-10 12:03:17   6.0  6.0

然后迭代:

df['new'] = df['new'].interpolate()
print (df)
                     vals       new
2019-08-10 12:03:05   1.0  1.000000
2019-08-10 12:03:06   NaN  1.666667
2019-08-10 12:03:07   NaN  2.333333
2019-08-10 12:03:08   3.0  3.000000
2019-08-10 12:03:09   4.0  3.666667
2019-08-10 12:03:10   NaN  4.333333
2019-08-10 12:03:11   NaN  5.000000
2019-08-10 12:03:12   5.0  3.666667
2019-08-10 12:03:13   NaN  2.333333
2019-08-10 12:03:14   1.0  1.000000
2019-08-10 12:03:15   NaN  2.666667
2019-08-10 12:03:16   NaN  4.333333
2019-08-10 12:03:17   6.0  6.000000

测试需要增加2秒的索引时间:

df.index += pd.Timedelta(2, 's')
df['new'] = df.resample('3s', base=df.index[0].second % 3).first()
print (df)

                     vals  new
2019-08-10 12:03:07   1.0  1.0
2019-08-10 12:03:08   NaN  NaN
2019-08-10 12:03:09   NaN  NaN
2019-08-10 12:03:10   3.0  3.0
2019-08-10 12:03:11   4.0  NaN
2019-08-10 12:03:12   NaN  NaN
2019-08-10 12:03:13   NaN  5.0
2019-08-10 12:03:14   5.0  NaN
2019-08-10 12:03:15   NaN  NaN
2019-08-10 12:03:16   1.0  1.0
2019-08-10 12:03:17   NaN  NaN
2019-08-10 12:03:18   NaN  NaN
2019-08-10 12:03:19   6.0  6.0

答案 1 :(得分:1)

df1=df.set_index(['Time']).interpolate(method='linear').reset_index()
print(df1)

输出

                   Time     vals
0   2019-08-10 12:03:05     1.000000
1   2019-08-10 12:03:06     1.666667
2   2019-08-10 12:03:07     2.333333
3   2019-08-10 12:03:08     3.000000
4   2019-08-10 12:03:09     4.000000
5   2019-08-10 12:03:10     4.333333
6   2019-08-10 12:03:11     4.666667
7   2019-08-10 12:03:12     5.000000
8   2019-08-10 12:03:13     3.000000
9   2019-08-10 12:03:14     1.000000
10  2019-08-10 12:03:15     2.666667
11  2019-08-10 12:03:16     4.333333
12  2019-08-10 12:03:17     6.000000

答案 2 :(得分:0)

如果要将nan值替换为最接近的值,则可以使用插值法

data['value'] = data['value'].interpolate(method='nearest')