具有以下 DF:
id timestamp
0 1 2020-09-01 15:14:35
1 1 2020-09-01 15:15:40
2 1 2020-09-01 15:16:59
3 1 2020-09-01 15:24:42
4 1 2020-09-01 15:25:50
5 1 2020-09-01 15:26:40
6 2 2020-09-01 18:14:35
7 2 2020-09-01 18:17:39
8 2 2020-09-01 18:24:40
9 2 2020-09-01 18:24:42
10 2 2020-09-01 18:34:40
11 2 2020-09-01 18:35:40
12 2 2020-09-01 18:36:40
每个 id 是一个服务器端点,时间戳是单个请求的时间。绘制时间线图:
我想计算每个服务器的加载周期数,我定义了一个加载周期:
至少 3 个时间差小于 5 分钟的请求。
所以服务器 1 有 2 个负载,而服务器 2 只有 1 个负载。我希望输出如下:
id timestamp loads_detected
0 1 2020-09-01 15:14:35 0
1 1 2020-09-01 15:15:40 0
2 1 2020-09-01 15:16:59 1 <-- 3 requests in a row with less than 5 minuets a part
3 1 2020-09-01 15:25:42 1 <-- next request is more than 5 minutes
4 1 2020-09-01 15:25:50 1
5 1 2020-09-01 15:26:40 2 <-- 3 requests in a row with less than 5 minuets a part
6 2 2020-09-01 18:14:35 0
7 2 2020-09-01 18:17:39 0 <-- Only 2 requests with less than 5 minuets, not increasing counter
8 2 2020-09-01 18:24:40 0
9 2 2020-09-01 18:24:42 0
10 2 2020-09-01 18:34:40 0
11 2 2020-09-01 18:35:40 0
12 2 2020-09-01 18:36:40 1 <-- 3 requests in a row with less than 5 minuets a part
任何帮助将不胜感激:)
答案 0 :(得分:1)
IIUC,您可以按 id 和 频率 5 分钟进行分组,计算连续 3 个请求出现的次数,然后对该结果进行累计:>
df['loads_detected'] = df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')]).cumcount().eq(2)
df['loads_detected'] = df.groupby('id').cumsum()
print(df)
输出
id timestamp loads_detected
0 1 2020-09-01 15:14:35 0
1 1 2020-09-01 15:15:40 0
2 1 2020-09-01 15:16:59 1
3 1 2020-09-01 15:24:42 1
4 1 2020-09-01 15:25:50 1
5 1 2020-09-01 15:26:40 2
6 2 2020-09-01 18:14:35 0
7 2 2020-09-01 18:17:39 0
8 2 2020-09-01 18:24:40 0
9 2 2020-09-01 18:24:42 0
10 2 2020-09-01 18:34:40 0
11 2 2020-09-01 18:35:40 0
12 2 2020-09-01 18:36:40 1