熊猫Python - 寻找未涵盖的时间序列

时间:2016-08-19 00:37:10

标签: python pandas time-series

希望有人可以帮我解决这个问题,因为我甚至不知道从哪里开始。

给定包含一系列开始和结束时间的数据框,例如:

Order   Start Time              End Time
1       2016-08-18 09:30:00.000 2016-08-18 09:30:05.000
1       2016-08-18 09:30:00.005 2016-08-18 09:30:25.001
1       2016-08-18 09:30:30.001 2016-08-18 09:30:56.002
1       2016-08-18 09:30:40.003 2016-08-18 09:31:05.003
1       2016-08-18 11:30:45.000 2016-08-18 13:31:05.000

对于每个订单ID,我希望找到最早开始时间和最晚结束时间之间的任何范围未涵盖的时间段列表

所以在上面的例子中,我会寻找

2016-08-18 09:30:05.000 to 2016-08-18 09:30:00.005 (the time lag between the first and second rows)
2016-08-18 09:30:25.001 to 2016-08-18 09:30:30.001 (the time lag between the second and third rows)

2016-08-18 09:31:05.003 to 2016-08-18 11:30:45.000 (the time period between 4 and 5)

3行和4行之间存在重叠,因此不会计算

需要考虑的一些事项(其他颜色):

每条记录都显示在(例如)其中一个证券交易所的未完成订单。因此,我可以同时在纳斯达克和纽约证券交易所开设订单。我也可以在纳斯达克短期订单,在纽约证券交易所同时开始长期订单。

这看起来如下:

Order   Start Time              End Time
1       2016-08-18 09:30:00.000 2016-08-18 09:30:05.000  (NYSE)
1       2016-08-18 09:30:00.001 2016-08-18 09:30:00.002  (NASDAQ)

我想弄清楚什么时候什么都不做,而且我在任何交易所都没有现场订单。

我不知道从哪里开始这个......任何想法都会受到赞赏

1 个答案:

答案 0 :(得分:1)

设置

from StringIO import StringIO
import pandas as pd

text = """Order   Start Time               End Time
1       2016-08-18 09:30:00.000  2016-08-18 09:30:05.000
1       2016-08-18 09:30:00.005  2016-08-18 09:30:25.001
1       2016-08-18 09:30:30.001  2016-08-18 09:30:56.002
1       2016-08-18 09:30:40.003  2016-08-18 09:31:05.003
1       2016-08-18 11:30:45.000  2016-08-18 13:31:05.000
2       2016-08-18 09:30:00.000  2016-08-18 09:30:05.000
2       2016-08-18 09:30:00.005  2016-08-18 09:30:25.001
2       2016-08-18 09:30:30.001  2016-08-18 09:30:56.002
2       2016-08-18 09:30:40.003  2016-08-18 09:31:05.003
2       2016-08-18 11:30:45.000  2016-08-18 13:31:05.000"""

df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[1, 2])

解决方案

def find_gaps(df, start_text='Start Time', end_text='End Time'):
    # rearrange stuff to get all times and a tracker
    # in single columns.
    cols = [start_text, end_text]
    df = df.reset_index()
    df1 = df[cols].stack().reset_index(-1)
    df1.columns = ['edge', 'time']
    df1['edge'] = df1['edge'].eq(start_text).mul(2).sub(1)

    # sort by ascending time, then descending edge
    # (starts before ends if equal time)
    # this will ensure we avoid zero length gaps.
    df1 = df1.sort_values(['time', 'edge'], ascending=[True, False])

    # we identify gaps when we've reached a number
    # of ends equal to number of starts.
    # we'll track that with cumsum, when cumsum is
    # zero, we've found a gap
    # last position should always be zero and is not a gap.
    # So I remove it.
    track = df1['edge'].cumsum().iloc[:-1]

    gap_starts = track.index[track == 0]
    gaps = df.ix[gap_starts]
    gaps[start_text] = gaps[end_text]
    gaps[end_text] = df.shift(-1).ix[gap_starts, start_text]

    return gaps

df.set_index('Order').groupby(level=0).apply(find_gaps)

enter image description here