我有一个格式如下的大数据框
timestamp | col1 | col2 ...
我想选择至少间隔x分钟的行,其中x可以是5,10,30,依此类推。问题是时间戳arent的间隔相等,所以我不能做一个简单的“每n取一”行”把戏。
示例:
timestamp | col1 | col2
'2019-01-15 17:52:29.955000', x, b
'2019-01-15 17:58:29.531000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:27:46.324000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:28:27.406000', x, b
'2019-01-16 07:34:35.194000', x, b
如果间隔= 10:
结果:
'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:34:35.194000', x, b
如果间隔= 30:
结果:
'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 07:22:08.170000', x, b
我可以使用蛮力n ^ 2的方法,但是我敢肯定有一种熊猫方法让我错过了。
谢谢! :)
编辑:只是为了澄清,它不是Calculate time difference between Pandas Dataframe indices的重复项。我需要根据给定的时间间隔对数据帧进行子集化
答案 0 :(得分:2)
就像评论一样,您似乎需要执行for
循环。而且还不错,因为您正在执行O(n)
循环:
def sampling(df, thresh):
thresh = pd.to_timedelta(thresh)
time_diff = df.timestamp.diff().fillna(pd.Timedelta(seconds=0))
ret = [0]
running_total = pd.to_timedelta(0)
for i in df.index:
running_total += time_diff[i]
if running_total >= thresh:
ret.append(i)
running_total = pd.to_timedelta(0)
return df.loc[ret].copy()
然后sampling(df, '10T')
给出
timestamp col1 col2
0 2019-01-15 17:52:29.955 x b
2 2019-01-16 03:21:48.255 x b
4 2019-01-16 03:33:09.984 x b
5 2019-01-16 07:22:08.170 x b
7 2019-01-16 07:34:35.194 x b
和sampling(df, '30T')
给出:
timestamp col1 col2
0 2019-01-15 17:52:29.955 x b
2 2019-01-16 03:21:48.255 x b
5 2019-01-16 07:22:08.170 x b