Question

我有一个格式如下的大数据框

timestamp | col1 | col2 ...

我想选择至少间隔x分钟的行，其中x可以是5,10,30，依此类推。问题是时间戳arent的间隔相等，所以我不能做一个简单的“每n取一”行”把戏。

示例：

timestamp | col1 | col2

'2019-01-15 17:52:29.955000', x, b
'2019-01-15 17:58:29.531000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:27:46.324000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:28:27.406000', x, b
'2019-01-16 07:34:35.194000', x, b

如果间隔= 10：

结果：

'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:34:35.194000', x, b

如果间隔= 30：

结果：

'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 07:22:08.170000', x, b

我可以使用蛮力n ^ 2的方法，但是我敢肯定有一种熊猫方法让我错过了。

谢谢！：）

编辑：只是为了澄清，它不是Calculate time difference between Pandas Dataframe indices的重复项。我需要根据给定的时间间隔对数据帧进行子集化

Answer 1

就像评论一样，您似乎需要执行for循环。而且还不错，因为您正在执行O(n)循环：

def sampling(df, thresh):
    thresh = pd.to_timedelta(thresh)
    time_diff = df.timestamp.diff().fillna(pd.Timedelta(seconds=0))
    ret = [0]
    running_total = pd.to_timedelta(0)
    for i in df.index:
        running_total += time_diff[i]
        if running_total >= thresh:
            ret.append(i)
            running_total = pd.to_timedelta(0)

    return df.loc[ret].copy()

然后sampling(df, '10T')给出

                timestamp col1 col2
0 2019-01-15 17:52:29.955    x    b
2 2019-01-16 03:21:48.255    x    b
4 2019-01-16 03:33:09.984    x    b
5 2019-01-16 07:22:08.170    x    b
7 2019-01-16 07:34:35.194    x    b

和sampling(df, '30T')给出：

                timestamp col1 col2
0 2019-01-15 17:52:29.955    x    b
2 2019-01-16 03:21:48.255    x    b
5 2019-01-16 07:22:08.170    x    b

熊猫选择具有给定时间戳记间隔的行

1 个答案: