pandas DataFrame从不规则时间序列索引重新采样

时间:2018-03-09 10:46:36

标签: python pandas datetime time-series

我想将DataFrame重新采样到每五秒钟,其中原始数据的时间戳是不规则的。抱歉,如果这看起来像一个重复的问题,但我有插值排列到数据时间戳的问题,这就是我在这个问题中包含我的DataFrame的原因。 The graph in this answer显示了我想要的结果,但我不能使用那里建议的 Altitude Time 1 0.00 0.00000 2 1000.00 16.45350 3 2000.00 33.19584 4 3000.00 50.25330 5 4000.00 67.64580 6 5000.00 85.38720 7 6000.00 103.56720 8 7000.00 122.29260 9 8000.00 141.61440 10 9000.00 161.59140 11 9999.67 182.27940 12 10000.30 182.33940 13 10000.30 199.76880 14 10000.30 199.82880 15 11000.00 221.67660 16 12000.00 244.36260 17 13000.00 267.93900 18 14000.00 292.46940 19 15000.00 318.01080 20 16000.00 344.36820 21 17000.00 371.32200 22 18000.00 398.91420 23 19000.00 427.19100 24 20000.00 456.24900 25 21000.00 486.38940 26 22000.00 517.91640 27 23000.00 550.96140 28 24000.00 585.65460 29 25000.00 622.12800 30 26000.00 660.35400 31 27000.00 700.37400 32 28000.00 742.39200 33 29000.00 786.57600 34 30000.00 833.13000 35 31000.00 882.09000 36 32000.00 933.46200 37 33000.00 987.40800 38 34000.00 1044.06000 39 35000.00 1103.85000 40 36000.00 1167.52200 41 36088.90 1173.39000 42 36089.60 1173.45000 43 36671.70 1216.60200 44 36672.40 1216.66200 45 38000.00 1295.80200 46 39000.00 1368.45000 47 40000.00 1458.00000 48 41000.00 1574.08200 49 42000.00 1730.97000 50 42231.00 1775.19600 包。我使用df = df.set_index(pd.to_datetime(df['Time'], unit='s'), drop=False) resample_index = pd.date_range(start=df.index[0], end=df.index[-1], freq='5s') dummy_frame = pd.DataFrame(np.NaN, index=resample_index, columns=df.columns) df.combine_first(dummy_frame).interpolate().iloc[:6] Time Altitude 1970-01-01 00:00:00.000000 0.000000 0.0 1970-01-01 00:00:05.000000 4.113375 250.0 1970-01-01 00:00:10.000000 8.226750 500.0 1970-01-01 00:00:15.000000 12.340125 750.0 1970-01-01 00:00:16.453500 16.453500 1000.0 1970-01-01 00:00:20.000000 20.639085 1250.0

考虑以下飞机的爬升路径(as dict on pastebin):

df.resample('1s').interpolate(method='linear').resample('5s').asfreq()

                       Time      Altitude
1970-01-01 00:00:00     0.0      0.000000
1970-01-01 00:00:05     5.0    137.174211
1970-01-01 00:00:10    10.0    274.348422
1970-01-01 00:00:15    15.0    411.522634
1970-01-01 00:00:20    20.0    548.696845
1970-01-01 00:00:25    25.0    685.871056
1970-01-01 00:00:30    30.0    823.045267
1970-01-01 00:00:35    35.0    960.219479
1970-01-01 00:00:40    40.0   1097.393690
1970-01-01 00:00:45    45.0   1234.567901
1970-01-01 00:00:50    50.0   1371.742112
1970-01-01 00:00:55    55.0   1508.916324
1970-01-01 00:01:00    60.0   1646.090535
1970-01-01 00:01:05    65.0   1783.264746
1970-01-01 00:01:10    70.0   1920.438957
1970-01-01 00:01:15    75.0   2057.613169
1970-01-01 00:01:20    80.0   2194.787380
1970-01-01 00:01:25    85.0   2331.961591
1970-01-01 00:01:30    90.0   2469.135802
1970-01-01 00:01:35    95.0   2606.310014
1970-01-01 00:01:40   100.0   2743.484225
1970-01-01 00:01:45   105.0   2880.658436
1970-01-01 00:01:50   110.0   3017.832647
1970-01-01 00:01:55   115.0   3155.006859
1970-01-01 00:02:00   120.0   3292.181070
1970-01-01 00:02:05   125.0   3429.355281
1970-01-01 00:02:10   130.0   3566.529492
1970-01-01 00:02:15   135.0   3703.703704
1970-01-01 00:02:20   140.0   3840.877915
1970-01-01 00:02:25   145.0   3978.052126
...                     ...           ...
1970-01-01 00:27:10  1458.0  40000.000000
1970-01-01 00:27:15  1458.0  40000.000000
1970-01-01 00:27:20  1458.0  40000.000000
1970-01-01 00:27:25  1458.0  40000.000000
1970-01-01 00:27:30  1458.0  40000.000000
1970-01-01 00:27:35  1458.0  40000.000000
1970-01-01 00:27:40  1458.0  40000.000000
1970-01-01 00:27:45  1458.0  40000.000000
1970-01-01 00:27:50  1458.0  40000.000000
1970-01-01 00:27:55  1458.0  40000.000000
1970-01-01 00:28:00  1458.0  40000.000000
1970-01-01 00:28:05  1458.0  40000.000000
1970-01-01 00:28:10  1458.0  40000.000000
1970-01-01 00:28:15  1458.0  40000.000000
1970-01-01 00:28:20  1458.0  40000.000000
1970-01-01 00:28:25  1458.0  40000.000000
1970-01-01 00:28:30  1458.0  40000.000000
1970-01-01 00:28:35  1458.0  40000.000000
1970-01-01 00:28:40  1458.0  40000.000000
1970-01-01 00:28:45  1458.0  40000.000000
1970-01-01 00:28:50  1458.0  40000.000000
1970-01-01 00:28:55  1458.0  40000.000000
1970-01-01 00:29:00  1458.0  40000.000000
1970-01-01 00:29:05  1458.0  40000.000000
1970-01-01 00:29:10  1458.0  40000.000000
1970-01-01 00:29:15  1458.0  40000.000000
1970-01-01 00:29:20  1458.0  40000.000000
1970-01-01 00:29:25  1458.0  40000.000000
1970-01-01 00:29:30  1458.0  40000.000000
1970-01-01 00:29:35  1458.0  40000.000000

尝试过的解决方案

首先,我已经尝试重新取样,同时保持原始索引不变,如this question所示,所以我可以线性插值,但我发现没有插值方法可以产生正确的结果(注意仅在16.45s匹配的原始时间列):

   $(document).ready(function() {
        var voices = window.speechSynthesis.getVoices();
    })

第二,我尝试重新取样而不保留原始索引,首先降至1秒,然后升至5秒,如this answer所示,但插值值不在数据结束,高度值也不是(1000英尺应该在15到20秒之间)。刚刚重新采样到1s已经产生了错误的结果。

<Directory /var/www/html>
           Options Indexes FollowSymLinks MultiViews
           AllowOverride All
           Order allow,deny
           allow from all
</Directory>

问题

如何在执行正确插值的同时将原始数据重采样为5s?我只是使用错误的插值方法吗?

2 个答案:

答案 0 :(得分:2)

在@Martin Schmelzer的帮助下(谢谢!)我在将time用作pandas插值方法的method参数时,找到了问题中第一个建议的方法:

resample_index = pd.date_range(start=df.index[0], end=df.index[-1], freq='5s')
dummy_frame = pd.DataFrame(np.NaN, index=resample_index, columns=df.columns)
df.combine_first(dummy_frame).interpolate('time').iloc[:6]

                               Altitude     Time
1970-01-01 00:00:00.000000     0.000000   0.0000
1970-01-01 00:00:05.000000   303.886711   5.0000
1970-01-01 00:00:10.000000   607.773422  10.0000
1970-01-01 00:00:15.000000   911.660133  15.0000
1970-01-01 00:00:16.453500  1000.000000  16.4535
1970-01-01 00:00:20.000000  1211.828215  20.0000

然后,我可以将其重新取样为5秒或其他,结果是准确的。

df.combine_first(dummy_frame).interpolate('time').resample('5s').asfreq().head()
                        Altitude  Time
1970-01-01 00:00:00     0.000000   0.0
1970-01-01 00:00:05   303.886711   5.0
1970-01-01 00:00:10   607.773422  10.0
1970-01-01 00:00:15   911.660133  15.0
1970-01-01 00:00:20  1211.828215  20.0

所以最后我发现我只是使用了错误的插值方法。

答案 1 :(得分:0)

我发现这个问题出奇的困难。特别是如果date_range()不容易定义内插值集。有很多陷阱:

  1. 原始数据集中的重复项将传播到插值数据帧中的重复项。这是不希望有的行为,会导致插入不同长度的数组。
  2. 如果插值值已经在数据框中,则将添加重复项。
  3. 您必须确保合并数据帧,然后进行适当的排序。

此代码对我有用:

const { circularProperty, ...rest } = item;    
$.extend(true, {}, rest);

import pandas as pd import numpy as np def interpolate_into(df, interpolate_keys, index_name, columns): # Downselect to only those columns necessary # Also, remove duplicated values in the data frame. Eye roll. df = df[[index_name] + columns] df = df.drop_duplicates(subset=[index_name], keep="first") df = df.set_index(index_name) # Only interpolate into values that don't already exist. This is not handled manually. needed_interpolate_keys = [i for i in interpolate_keys if i not in df.index] # Create a dummy DF that has the x or time values we want to interpolate into. dummy_frame = pd.DataFrame(np.NaN, index=needed_interpolate_keys, columns=df.columns) dummy_frame[index_name] = pd.to_datetime(needed_interpolate_keys) dummy_frame = dummy_frame.set_index(index_name) # Combine the dataframes, sort, interpolate, downselect. df = dummy_frame.combine_first(df) df = df.sort_values(by=index_name, ascending=True) df = df.interpolate() df = df[df.index.isin(interpolate_keys)] return df 是原始数据帧。

df是要为其内插新值的“ x”值的列表。

interpolated_keys是这些键的列的名称

index_name是要为其插值的其他列。