Question

我有一个pandas数据框，例如：

id         time
1             1
2             3
3             4
4             5
5             8
6             8

，我想删除相距少于2秒的行。我首先计算连续行之间的时间差并将其添加为列：

df['time_since_last_detect'] = df.time.diff().fillna(0)

导致：

id         time       time_since_last_detect
1             1                            0
2             3                            2
3             4                            1
4             5                            1
5             8                            3
6             8                            0

，然后使用df[df.time_since_last_detect > 1]过滤行，结果为：

id         time       time_since_last_detect
2             3                            2
5             8                            3

但是，这样做的问题是，一旦删除一行，它就不会重新计算与新的上一行的差异。例如，在删除第一行和第三行之后，第二行和第四行之间的差将为2。但是仍然可以使用此过滤器删除第四行，但我不想发生这种情况。解决此问题的最佳方法是什么？这是我想要达到的预期结果：

id         time       time_since_last_detect
2             3                            2
4             5                            1
5             8                            3

Answer 1

这不是一个完美的解决方案，但是您可以根据自己的情况进行以下操作。需要在下面进行修改以创建通用功能。

import pandas as pd

d = {'id' : [1,2,3,4,5,6], 'time' : [1,3,4,5,8,8]}
df = pd.DataFrame(data =d)

df['time_since_last_detect'] = df.time.diff().fillna(0)
timeperiod = 2

df['time_since_last_sum'] =  df['time_since_last_detect'].rolling(min_periods=1, window=timeperiod).sum().fillna(0) # gets sum of rolling period , in this case 2. One case change as needed

df_final =  df.loc[(df['time_since_last_detect'] >= 2) | (df['time_since_last_sum'] == 2)] # Filter data with 2 OR condition 1. If last_detect>2 or last of 2 rolling period is 2

输出：

   id  time  time_since_last_detect  time_since_last_sum
   2     3                     2.0                  2.0
   4     5                     1.0                  2.0
   5     8                     3.0                  4.0

熊猫从数据框中过滤行，且行连续差<n

1 个答案: