沿时间序列索引合并多个具有相同开始时间和结束时间但长度不同的数据帧

时间:2018-12-18 11:13:03

标签: python pandas dataframe join time-series

我一直在为此打自己。尽管我认为我是对的,但我并没有说服我,我想与他人分享我的解决方案,告诉别人是或不是?

我想沿着时间序列索引以相同的开始时间和结束时间连接多个数据帧,但是每个数据帧的长度都不同。然后,我想确保针对丢失的时间戳重新调整时间序列中的所有中断,并与原始数据帧中的数据相关地对丢失的值进行填充。

DataFrame1
Time    O   H   L   C   Symbol
00:00:00    2   3   1   1   XXX/XXX
01:00:00    1   4   1   1   XXX/XXX
02:00:00    1   4   1   1   XXX/XXX
03:00:00    1   4   1   1   XXX/XXX
04:00:00    2   3   1   1   XXX/XXX
05:00:00    1   3   1   1   XXX/XXX
06:00:00    1   3   1   1   XXX/XXX
07:00:00    2   4   1   1   XXX/XXX
08:00:00    2   3   1   1   XXX/XXX
09:00:00    1   4   1   1   XXX/XXX
10:00:00    1   3   1   1   XXX/XXX
11:00:00    2   4   1   1   XXX/XXX
12:00:00    1   4   1   1   XXX/XXX
13:00:00    2   3   1   1   XXX/XXX
14:00:00    2   4   1   1   XXX/XXX

Len = 15

DataFrame2:

Time    O   H   L   C   Symbol
00:00:00    2   3   1   1   XXX/YYY
01:00:00    1   4   1   1   XXX/YYY
02:00:00    1   4   1   1   XXX/YYY
03:00:00    1   4   1   1   XXX/YYY
04:00:00    2   3   1   1   XXX/YYY
06:00:00    1   3   1   1   XXX/YYY
07:00:00    1   3   1   1   XXX/YYY
08:00:00    2   4   1   1   XXX/YYY
09:00:00    2   3   1   1   XXX/YYY
10:00:00    1   4   1   1   XXX/YYY
12:00:00    1   3   1   1   XXX/YYY
13:00:00    2   4   1   1   XXX/YYY
14:00:00    1   4   1   1   XXX/YYY

Len = 13

DataFrame3:

Time    O   H   L   C   Symbol
00:00:00    2   3   1   1   XXX/ZZZ
02:00:00    1   4   1   1   XXX/ZZZ
03:00:00    1   4   1   1   XXX/ZZZ
04:00:00    1   4   1   1   XXX/ZZZ
05:00:00    2   3   1   1   XXX/ZZZ
06:00:00    1   3   1   1   XXX/ZZZ
07:00:00    1   3   1   1   XXX/ZZZ
08:00:00    2   4   1   1   XXX/ZZZ
10:00:00    1   4   1   1   XXX/ZZZ
11:00:00    1   3   1   1   XXX/ZZZ
12:00:00    2   4   1   1   XXX/ZZZ
14:00:00    1   4   1   1   XXX/ZZZ

Len = 12

最终结果应为: Aligned dataframe which shows all data before padding forward

Time    O   H   L   C   Symbol      Time    O   H   L   C   Symbol      Time    O   H   L   C   Symbol
00:00:00    2   3   1   1   XXX/XXX     00:00:00    2   3   1   1   XXX/YYY     00:00:00    2   3   1   1   XXX/ZZZ
01:00:00    1   4   1   1   XXX/XXX     01:00:00    1   4   1   1   XXX/YYY     01:00:00    nan nan nan nan nan
02:00:00    1   4   1   1   XXX/XXX     02:00:00    1   4   1   1   XXX/YYY     02:00:00    1   4   1   1   XXX/ZZZ
03:00:00    1   4   1   1   XXX/XXX     03:00:00    1   4   1   1   XXX/YYY     03:00:00    1   4   1   1   XXX/ZZZ
04:00:00    2   3   1   1   XXX/XXX     04:00:00    2   3   1   1   XXX/YYY     04:00:00    1   4   1   1   XXX/ZZZ
05:00:00    1   3   1   1   XXX/XXX     05:00:00    nan nan nan nan nan     05:00:00    2   3   1   1   XXX/ZZZ
06:00:00    1   3   1   1   XXX/XXX     06:00:00    1   3   1   1   XXX/YYY     06:00:00    1   3   1   1   XXX/ZZZ
07:00:00    2   4   1   1   XXX/XXX     07:00:00    1   3   1   1   XXX/YYY     07:00:00    1   3   1   1   XXX/ZZZ
08:00:00    2   3   1   1   XXX/XXX     08:00:00    2   4   1   1   XXX/YYY     08:00:00    2   4   1   1   XXX/ZZZ
09:00:00    1   4   1   1   XXX/XXX     09:00:00    2   3   1   1   XXX/YYY     09:00:00    nan nan nan nan nan
10:00:00    1   3   1   1   XXX/XXX     10:00:00    1   4   1   1   XXX/YYY     10:00:00    1   4   1   1   XXX/ZZZ
11:00:00    2   4   1   1   XXX/XXX     11:00:00    nan nan nan nan nan     11:00:00    1   3   1   1   XXX/ZZZ
12:00:00    1   4   1   1   XXX/XXX     12:00:00    1   3   1   1   XXX/YYY     12:00:00    2   4   1   1   XXX/ZZZ
13:00:00    2   3   1   1   XXX/XXX     13:00:00    2   4   1   1   XXX/YYY     13:00:00    nan nan nan nan nan
14:00:00    2   4   1   1   XXX/XXX     14:00:00    1   4   1   1   XXX/YYY     14:00:00    1   4   1   1   XXX/ZZZ

我采用的方法是: 要沿时间索引连接每个dataFrame

> table =
> DataTableEurUsd.reset_index("Time").join(DataTableAudUsd.reset_index("Time"),
> lsuffix="_y", rsuffix="_x").join(DataTableEurChf.reset_index("Time"),
> lsuffix="_y", rsuffix="_x")

位置:

DataTableEurUsd =
        Open    High    Low Close   RealVolume  Spread  TickVolume  Symbol
    Time                                
    2010.01.04 00:00:00 1.43259 1.43336 1.43151 1.43153 0.0 12.0    969.0   EURUSD
    2010.01.04 01:00:00 1.43151 1.43153 1.42879 1.42886 0.0 15.0    2098.0  EURUSD
    2010.01.04 02:00:00 1.42885 1.42885 1.42569 1.42705 0.0 15.0    2082.0  EURUSD
    2010.01.04 03:00:00 1.42702 1.42989 1.42700 1.42939 0.0 14.0    1544.0  EURUSD
    2010.01.04 05:00:00 1.42938 1.42968 1.42718 1.42848 0.0 15.0    1131.0  EURUSD

DataTableAudUsd =
        Open    High    Low Close   RealVolume  Spread  TickVolume  Symbol
    Time                                
    2010.01.04 00:00:00 0.89938 0.89953 0.89709 0.89711 0.0 30.0    1144.0  AUDUSD
    2010.01.04 01:00:00 0.89712 0.89795 0.89612 0.89632 0.0 35.0    1735.0  AUDUSD
    2010.01.04 02:00:00 0.89634 0.89645 0.89372 0.89500 0.0 30.0    1771.0  AUDUSD
    2010.01.04 04:00:00 0.89502 0.89653 0.89502 0.89613 0.0 35.0    1242.0  AUDUSD
    2010.01.04 05:00:00 0.89611 0.89648 0.89479 0.89633 0.0 30.0    663.0   AUDUSD

DataTableEurChf =

    Open    High    Low Close   RealVolume  Spread  TickVolume  Symbol
Time                                
2010.01.04 00:00:00 1.48238 1.48354 1.48227 1.48334 0.0 36.0    1232.0  EURCHF
2010.01.04 02:00:00 1.48327 1.48470 1.48087 1.48250 0.0 34.0    2186.0  EURCHF
2010.01.04 03:00:00 1.48251 1.48311 1.48150 1.48294 0.0 34.0    1939.0  EURCHF
2010.01.04 04:00:00 1.48292 1.48317 1.48114 1.48239 0.0 34.0    1510.0  EURCHF
2010.01.04 05:00:00 1.48235 1.48245 1.48150 1.48181 0.0 34.0    1230.0  EURCHF

然后我将在Nan上前进

table = table.fillna(method='ffill')

我想确保所有原始数据都保留在正确的位置,并且时间序列索引填充了索引上缺少的小时,如我发布的excel屏幕截图中所示。
如果不清楚,我很乐意发布更多信息以帮助解释。

最良好的祝愿

0 个答案:

没有答案