我在python documentation中看到了重新采样和同步两个时间序列的能力。我的问题更难,因为时间序列没有时间规律。我读了三个具有非确定性的日内时间戳的时间序列。但是,为了对这两个时间序列进行大多数分析(协方差,相关性等),我需要它们具有相同的长度。
在Matlab中,给出了三个具有非确定性日内时间戳的时间序列ts1, ts2, ts3
,我可以synchronize说出
[ts1, ts2] = synchronize(ts1, ts2, 'union');
[ts1, ts3] = synchronize(ts1, ts3, 'union');
[ts2, ts3] = synchronize(ts2, ts3, 'union');
请注意,时间序列已经读入pandas DataFrame,因此我需要能够与已创建的DataFrames同步(并重新取样?)。
答案 0 :(得分:0)
也可以通过 merge
来 synchronize
数据帧。特别是我们可能希望将 2 个数据帧与 2 个不同的数据字段同步以保留而不是 1 个。例如,假设我们有这 3 个具有温度和湿度值的数据帧要同步:
df1
company_id log_date temperature
0 4 2020-02-29 00:00:00 24.0
1 4 2020-02-29 00:03:00 24.0
2 4 2020-02-29 00:06:00 23.9
3 4 2020-02-29 00:09:00 23.8
4 4 2020-02-29 00:12:00 23.8
5 4 2020-02-29 00:15:00 23.7
6 4 2020-02-29 00:18:00 23.6
7 4 2020-02-29 00:21:00 23.5
8 4 2020-02-29 00:24:00 23.4
9 4 2020-02-29 00:27:00 23.3
10 4 2020-02-29 00:30:00 24.0
11 4 2020-02-29 00:33:00 21.0
12 4 2020-02-29 00:36:00 22.9
13 4 2020-02-29 00:39:00 23.8
14 4 2020-02-29 00:42:00 22.8
15 4 2020-02-29 00:45:00 21.7
16 4 2020-02-29 00:48:00 22.6
17 4 2020-02-29 00:51:00 21.5
df2
company_id log_date humidity
0 4 2020-02-29 00:00:00 74.92
1 4 2020-02-29 00:05:00 75.00
2 4 2020-02-29 00:10:00 73.10
3 4 2020-02-29 00:15:00 72.10
4 4 2020-02-29 00:20:00 72.00
5 4 2020-02-29 00:25:00 73.00
6 4 2020-02-29 00:30:00 74.00
7 4 2020-02-29 00:35:00 72.10
8 4 2020-02-29 00:45:00 69.00
9 4 2020-02-29 00:50:00 71.92
df3
company_id log_date temperature
0 4 2020-02-29 00:00:00 20.00
1 4 2020-02-29 00:05:00 21.00
2 4 2020-02-29 00:10:00 22.00
3 4 2020-02-29 00:15:00 23.00
4 4 2020-02-29 00:20:00 23.10
5 4 2020-02-29 00:25:00 22.00
6 4 2020-02-29 00:30:00 22.00
7 4 2020-02-29 00:35:00 22.10
8 4 2020-02-29 00:45:00 23.00
9 4 2020-02-29 00:50:00 21.92
我们可以做类似的事情
df1['log_date'] = pd.to_datetime(df1['log_date'])
df2['log_date'] = pd.to_datetime(df2['log_date'])
df3['log_date'] = pd.to_datetime(df3['log_date'])
df_a = pd.merge_asof(df1, df2, on="log_date", by="company_id", tolerance=pd.Timedelta("5m"))
df_b = pd.merge_asof(df1, df3, on="log_date", by="company_id", tolerance=pd.Timedelta("5m"))
以及由此产生的数据帧;
df_a
company_id log_date temperature humidity
0 4 2020-02-29 00:00:00 24.0 74.92
1 4 2020-02-29 00:03:00 24.0 74.92
2 4 2020-02-29 00:06:00 23.9 75.00
3 4 2020-02-29 00:09:00 23.8 75.00
4 4 2020-02-29 00:12:00 23.8 73.10
5 4 2020-02-29 00:15:00 23.7 72.10
6 4 2020-02-29 00:18:00 23.6 72.10
7 4 2020-02-29 00:21:00 23.5 72.00
8 4 2020-02-29 00:24:00 23.4 72.00
9 4 2020-02-29 00:27:00 23.3 73.00
10 4 2020-02-29 00:30:00 24.0 74.00
11 4 2020-02-29 00:33:00 21.0 74.00
12 4 2020-02-29 00:36:00 22.9 72.10
13 4 2020-02-29 00:39:00 23.8 72.10
14 4 2020-02-29 00:42:00 22.8 NaN
15 4 2020-02-29 00:45:00 21.7 69.00
16 4 2020-02-29 00:48:00 22.6 69.00
17 4 2020-02-29 00:51:00 21.5 71.92
df_b
company_id log_date temperature_x temperature_y
0 4 2020-02-29 00:00:00 24.0 20.00
1 4 2020-02-29 00:03:00 24.0 20.00
2 4 2020-02-29 00:06:00 23.9 21.00
3 4 2020-02-29 00:09:00 23.8 21.00
4 4 2020-02-29 00:12:00 23.8 22.00
5 4 2020-02-29 00:15:00 23.7 23.00
6 4 2020-02-29 00:18:00 23.6 23.00
7 4 2020-02-29 00:21:00 23.5 23.10
8 4 2020-02-29 00:24:00 23.4 23.10
9 4 2020-02-29 00:27:00 23.3 22.00
10 4 2020-02-29 00:30:00 24.0 22.00
11 4 2020-02-29 00:33:00 21.0 22.00
12 4 2020-02-29 00:36:00 22.9 22.10
13 4 2020-02-29 00:39:00 23.8 22.10
14 4 2020-02-29 00:42:00 22.8 NaN
15 4 2020-02-29 00:45:00 21.7 23.00
16 4 2020-02-29 00:48:00 22.6 23.00
17 4 2020-02-29 00:51:00 21.5 21.92
第一个我们有 2 个不同的数据字段 temperature
& humidity
,第二个我们有 2 个不同版本的 temperature
。这可能是您正在努力实现的目标。