熊猫选择具有给定时间戳记间隔的行

时间:2019-07-10 17:49:40

标签: python pandas

我有一个格式如下的大数据框

timestamp | col1 | col2 ...

我想选择至少间隔x分钟的行,其中x可以是5,10,30,依此类推。问题是时间戳arent的间隔相等,所以我不能做一个简单的“每n取一”行”把戏。

示例:

timestamp | col1 | col2

'2019-01-15 17:52:29.955000', x, b
'2019-01-15 17:58:29.531000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:27:46.324000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:28:27.406000', x, b
'2019-01-16 07:34:35.194000', x, b

如果间隔= 10:

结果:

'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:34:35.194000', x, b

如果间隔= 30:

结果:

'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 07:22:08.170000', x, b

我可以使用蛮力n ^ 2的方法,但是我敢肯定有一种熊猫方法让我错过了。

谢谢! :)

编辑:只是为了澄清,它不是Calculate time difference between Pandas Dataframe indices的重复项。我需要根据给定的时间间隔对数据帧进行子集化

1 个答案:

答案 0 :(得分:2)

就像评论一样,您似乎需要执行for循环。而且还不错,因为您正在执行O(n)循环:

def sampling(df, thresh):
    thresh = pd.to_timedelta(thresh)
    time_diff = df.timestamp.diff().fillna(pd.Timedelta(seconds=0))
    ret = [0]
    running_total = pd.to_timedelta(0)
    for i in df.index:
        running_total += time_diff[i]
        if running_total >= thresh:
            ret.append(i)
            running_total = pd.to_timedelta(0)

    return df.loc[ret].copy()

然后sampling(df, '10T')给出

                timestamp col1 col2
0 2019-01-15 17:52:29.955    x    b
2 2019-01-16 03:21:48.255    x    b
4 2019-01-16 03:33:09.984    x    b
5 2019-01-16 07:22:08.170    x    b
7 2019-01-16 07:34:35.194    x    b

sampling(df, '30T')给出:

                timestamp col1 col2
0 2019-01-15 17:52:29.955    x    b
2 2019-01-16 03:21:48.255    x    b
5 2019-01-16 07:22:08.170    x    b