我对时间序列分析有独特的要求。下面我提出了我的要求以及一个简单的工作解决方案。我还制定了示例来帮助理解我的要求。
我正在寻求帮助,使这个代码在(a)减少时间和空间复杂性,(b)如果通过在Pandas或其他python库中使用内置函数可以实现任何这些代码,则减少代码行。
请考虑以下时间序列数据:
import pandas as pd
df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-11'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'EndTime': {0: '09:13:46.867',1: '10:06:26.452', 2: '12:34:23.569', 3: '11:24:23.533', 4: '18:55:23.903', 5: '14:51:08.756'}})
df2 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC'}, 'Volume': {0: 100, 1: 300, 2: 600, 3: 1500, 4: 200}, 'Price': {0: 10.05, 1: 10.10, 2: 10.40, 3:10.50, 4: 10.45}, 'time': {0: '08:00:00.242', 1: '09:00:10.534', 2: '10:08:36.658', 3: '11:45:43.654', 4: '12:34:23.563'}})
print df1
print df2
Date EndTime StartTime Stock
0 2016-10-11 09:13:46.867 08:00:00.241 ABC
1 2016-10-11 10:06:26.452 08:00:00.243 ABC
2 2016-10-11 12:34:23.569 12:34:23.563 ABC
3 2016-10-11 11:24:23.533 08:14:05.908 ABC
4 2016-10-11 18:55:23.903 18:54:50.100 ABC
5 2016-10-11 14:51:08.756 10:08:36.657 XYZ
Date Price Stock Volume time
0 2016-10-11 10.05 ABC 100 08:00:00.242
1 2016-10-11 10.10 ABC 300 09:00:10.534
2 2016-10-11 10.40 ABC 600 10:08:36.658
3 2016-10-11 10.50 ABC 1500 11:45:43.654
4 2016-10-11 10.45 ABC 200 12:34:23.563
我希望以最有效的方式在python中编写以下两个函数,这些函数采用以下输入:
def do_asof(df1, df2, left_time='StartTime', right_time='time', left_on=['Date','Stock'], right_on=['Date','Stock'], uptoCols=2)
def do_onafter(df1, df2, left_time='StartTime', right_time='time', left_on=['Date','Stock'], right_on=['Date','Stock'], uptoCols=2)
第一个函数do_asof
执行以下操作:
1.在df1
和df2
列上的left_on
和right_on
之间执行完全匹配。在这种情况下,'Date'
和'Stock'
列上的内部联接如下:df3 = df1.merge(df2, on = ['Date','Stock'])
2.现在在df3
(a)摆脱'time' > 'StartTime'
的所有行。做df3 = df3.loc[df3['time'] <= df3['StartTime']]
之类的事情
(b)限制原始uptoCols = 2
中每行的df1
个条目的最大值。做df3 = df3.sort_values(['time'], ascending =[True]).groupby(['Date','Stock','StartTime','EndTime']).tail(2)
之类的事情
(c)外部加入原始df1,如下所示,以获得所需的输出df4 = df1.merge(df3, on =['Date','Stock','StartTime','EndTime'],how = 'outer')
print df4
Date EndTime StartTime Stock Price Volume Time
0 2016-10-11 09:13:46.867 08:00:00.241 ABC NaN NaN NaN
1 2016-10-11 10:06:26.452 08:00:00.243 ABC 10.05 100 08:00:00.242
2 2016-10-11 12:34:23.569 12:34:23.563 ABC 10.50 1500 11:45:43.654
3 2016-10-11 12:34:23.569 12:34:23.563 ABC 10.45 200 12:34:23.563
4 2016-10-11 11:24:23.533 08:14.05.908 ABC 10.05 100 08:00:00.242
5 2016-10-11 18:55:23.903 18:54:50.100 ABC 10.50 1500 11:45:43.654
6 2016-10-11 18:55:23.903 18:54:50.100 ABC 10.45 200 12:34:23.563
7 2016-10-11 14:51:08.756 10:08:36.657 XYZ NaN NaN NaN
第二个函数do_onafter
执行以下操作:
1.在df1
和df2
列的left_on
和right_on
之间执行类似的完全匹配。在这种情况下,'Date'
和'Stock'
列上的内部联接如下:df3 = df1.merge(df2, on = ['Date','Stock'])
2.现在在df3
(a)摆脱'time' < 'StartTime'
的所有行。做df3 = df3.loc[df3['time'] >= df3['StartTime']]
之类的事情
(b)限制原始uptoCols = 2
中每行的df1
个条目的最大值。做df3 = df3.sort_values(['time'], ascending =[True]).groupby(['Date','Stock','StartTime','EndTime']).head(2)
之类的事情
(c)外部连接到原始df1,如下所示,以获得所需的输出df4 = df1.merge(df3, on =['Date','Stock','StartTime','EndTime'],how = 'outer')
print df4
Date EndTime StartTime Stock Price Volume Time
0 2016-10-11 09:13:46.867 08:00:00.241 ABC 10.05 100 08:00:00.242
1 2016-10-11 09:13:46.867 08:00:00.241 ABC 10.10 300 09:00:10.534
2 2016-10-11 10:06:26.452 08:00:00.243 ABC 10.10 300 09:00:10.534
3 2016-10-11 10:06:26.452 08:00:00.243 ABC 10.40 600 10:08:36.658
4 2016-10-11 12:34:23.569 12:34:23.563 ABC 10.45 200 12:34:23.563
5 2016-10-11 11:24:23.533 08:14.05.908 ABC 10.10 300 09:00:10.534
6 2016-10-11 11:24:23.533 08:14.05.908 ABC 10.40 600 10:08:36.658
7 2016-10-11 18:55:23.903 18:54:50.100 ABC NaN NaN NaN
8 2016-10-11 14:51:08.756 10:08:36.657 XYZ NaN NaN NaN