交叉连接时间序列数据集按时间标准应用标准,同时限制行数

时间:2016-10-16 17:33:58

标签: python pandas join time-series

我对时间序列分析有独特的要求。下面我提出了我的要求以及一个简单的工作解决方案。我还制定了示例来帮助理解我的要求。

我正在寻求帮助,使这个代码在(a)减少时间和空间复杂性,(b)如果通过在Pandas或其他python库中使用内置函数可以实现任何这些代码,则减少代码行。

请考虑以下时间序列数据:

import pandas as pd

df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-11'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'EndTime': {0: '09:13:46.867',1: '10:06:26.452', 2: '12:34:23.569', 3: '11:24:23.533', 4: '18:55:23.903', 5: '14:51:08.756'}})
df2 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC'}, 'Volume': {0: 100, 1: 300, 2: 600, 3: 1500, 4: 200}, 'Price': {0: 10.05, 1: 10.10, 2: 10.40, 3:10.50, 4: 10.45}, 'time': {0: '08:00:00.242', 1: '09:00:10.534', 2: '10:08:36.658', 3: '11:45:43.654', 4: '12:34:23.563'}})

print df1
print df2

         Date       EndTime     StartTime Stock
0  2016-10-11  09:13:46.867  08:00:00.241   ABC
1  2016-10-11  10:06:26.452  08:00:00.243   ABC
2  2016-10-11  12:34:23.569  12:34:23.563   ABC
3  2016-10-11  11:24:23.533  08:14:05.908   ABC
4  2016-10-11  18:55:23.903  18:54:50.100   ABC
5  2016-10-11  14:51:08.756  10:08:36.657   XYZ

         Date   Price Stock  Volume          time
0  2016-10-11   10.05   ABC     100  08:00:00.242
1  2016-10-11   10.10   ABC     300  09:00:10.534
2  2016-10-11   10.40   ABC     600  10:08:36.658
3  2016-10-11   10.50   ABC    1500  11:45:43.654
4  2016-10-11   10.45   ABC     200  12:34:23.563

我希望以最有效的方式在python中编写以下两个函数,这些函数采用以下输入:

def do_asof(df1, df2, left_time='StartTime', right_time='time', left_on=['Date','Stock'], right_on=['Date','Stock'], uptoCols=2)
def do_onafter(df1, df2, left_time='StartTime', right_time='time', left_on=['Date','Stock'], right_on=['Date','Stock'], uptoCols=2)

第一个函数do_asof执行以下操作:
    1.在df1df2列上的left_onright_on之间执行完全匹配。在这种情况下,'Date''Stock'列上的内部联接如下:df3 = df1.merge(df2, on = ['Date','Stock'])
    2.现在在df3
        (a)摆脱'time' > 'StartTime'的所有行。做df3 = df3.loc[df3['time'] <= df3['StartTime']]之类的事情         (b)限制原始uptoCols = 2中每行的df1个条目的最大值。做df3 = df3.sort_values(['time'], ascending =[True]).groupby(['Date','Stock','StartTime','EndTime']).tail(2)之类的事情         (c)外部加入原始df1,如下所示,以获得所需的输出df4 = df1.merge(df3, on =['Date','Stock','StartTime','EndTime'],how = 'outer')

print df4

         Date       EndTime     StartTime Stock  Price  Volume          Time
0  2016-10-11  09:13:46.867  08:00:00.241   ABC    NaN     NaN           NaN
1  2016-10-11  10:06:26.452  08:00:00.243   ABC  10.05     100  08:00:00.242
2  2016-10-11  12:34:23.569  12:34:23.563   ABC  10.50    1500  11:45:43.654
3  2016-10-11  12:34:23.569  12:34:23.563   ABC  10.45     200  12:34:23.563
4  2016-10-11  11:24:23.533  08:14.05.908   ABC  10.05     100  08:00:00.242
5  2016-10-11  18:55:23.903  18:54:50.100   ABC  10.50    1500  11:45:43.654
6  2016-10-11  18:55:23.903  18:54:50.100   ABC  10.45     200  12:34:23.563
7  2016-10-11  14:51:08.756  10:08:36.657   XYZ    NaN     NaN           NaN

第二个函数do_onafter执行以下操作:
    1.在df1df2列的left_onright_on之间执行类似的完全匹配。在这种情况下,'Date''Stock'列上的内部联接如下:df3 = df1.merge(df2, on = ['Date','Stock'])
    2.现在在df3
        (a)摆脱'time' < 'StartTime'的所有行。做df3 = df3.loc[df3['time'] >= df3['StartTime']]之类的事情         (b)限制原始uptoCols = 2中每行的df1个条目的最大值。做df3 = df3.sort_values(['time'], ascending =[True]).groupby(['Date','Stock','StartTime','EndTime']).head(2)之类的事情         (c)外部连接到原始df1,如下所示,以获得所需的输出df4 = df1.merge(df3, on =['Date','Stock','StartTime','EndTime'],how = 'outer')

print df4

         Date       EndTime     StartTime Stock  Price  Volume          Time
0  2016-10-11  09:13:46.867  08:00:00.241   ABC  10.05     100  08:00:00.242
1  2016-10-11  09:13:46.867  08:00:00.241   ABC  10.10     300  09:00:10.534
2  2016-10-11  10:06:26.452  08:00:00.243   ABC  10.10     300  09:00:10.534
3  2016-10-11  10:06:26.452  08:00:00.243   ABC  10.40     600  10:08:36.658
4  2016-10-11  12:34:23.569  12:34:23.563   ABC  10.45     200  12:34:23.563
5  2016-10-11  11:24:23.533  08:14.05.908   ABC  10.10     300  09:00:10.534
6  2016-10-11  11:24:23.533  08:14.05.908   ABC  10.40     600  10:08:36.658
7  2016-10-11  18:55:23.903  18:54:50.100   ABC    NaN     NaN           NaN
8  2016-10-11  14:51:08.756  10:08:36.657   XYZ    NaN     NaN           NaN

0 个答案:

没有答案