检测并删除连续事件(时间序列中的行)而不更改值

时间:2017-08-23 13:30:37

标签: r dataframe

我必须在10年的时间序列中检测出有故障的传感器,时间步长为5分钟。代码应该是最快的。

例如:

 time<-seq(from=as.POSIXct("2009-01-09 00:00"),to=as.POSIXct("2009-01-09 01:35"), by= "5 min")

  A<-c(4.325775,5.11995,6.845995,5.56784, 1.845995,1.845995,1.845995,1.845995,9.45555,9.45558,
 5.93295,8.28395,9.645665,3.79955,6.34233,2.545995,1.745335,4.33321,9.125948,5.645568)


  df<-data.frame(time,A)
  df

   time                A
1  2009-01-09 00:00:00 4.325775
2  2009-01-09 00:05:00 5.119950
3  2009-01-09 00:10:00 6.845995
4  2009-01-09 00:15:00 5.567840
5  2009-01-09 00:20:00 1.845995
6  2009-01-09 00:25:00 1.845995
7  2009-01-09 00:30:00 1.845995
8  2009-01-09 00:35:00 1.845995
9  2009-01-09 00:40:00 9.455550
10 2009-01-09 00:45:00 9.455580
11 2009-01-09 00:50:00 5.932950
12 2009-01-09 00:55:00 8.283950
13 2009-01-09 01:00:00 9.645665
14 2009-01-09 01:05:00 3.799550
15 2009-01-09 01:10:00 6.342330
16 2009-01-09 01:15:00 2.545995
17 2009-01-09 01:20:00 1.745335
18 2009-01-09 01:25:00 4.333210
19 2009-01-09 01:30:00 9.125948
20 2009-01-09 01:35:00 5.645568

现在我想删除没有变化的成功事件之间的行,例如应删除行5,6,7,8。如果变化非常小(小于0.001),那么它也应该被删除(行9 and 10)。

我试图使用rle,但不知道是否有可能识别并删除没有变化或变化足够小的行。

2 个答案:

答案 0 :(得分:2)

这是dplyr中的一个简单解决方案:

df %>% 
  mutate(n_minus_1 = lag(A), 
         change = A - n_minus_1) %>% 
  filter(change > 0.000)


                 time        A n_minus_1   change
1 2009-01-09 00:05:00 5.119950  4.325775 0.794175
2 2009-01-09 00:10:00 6.845995  5.119950 1.726045
3 2009-01-09 00:40:00 9.455550  1.845995 7.609555
4 2009-01-09 00:45:00 9.455580  9.455550 0.000030
5 2009-01-09 00:55:00 8.283950  5.932950 2.351000
6 2009-01-09 01:00:00 9.645665  8.283950 1.361715
7 2009-01-09 01:10:00 6.342330  3.799550 2.542780
8 2009-01-09 01:25:00 4.333210  1.745335 2.587875
9 2009-01-09 01:30:00 9.125948  4.333210 4.792738

您当然可以删除n_minus_1 col:

df %>% 
  mutate(n_minus_1 = lag(A), 
         change = A - n_minus_1) %>% 
  filter(change > 0.000) %>%
  select(-n_minus_1)

                 time        A   change
1 2009-01-09 00:05:00 5.119950 0.794175
2 2009-01-09 00:10:00 6.845995 1.726045
3 2009-01-09 00:40:00 9.455550 7.609555
4 2009-01-09 00:45:00 9.455580 0.000030
5 2009-01-09 00:55:00 8.283950 2.351000
6 2009-01-09 01:00:00 9.645665 1.361715
7 2009-01-09 01:10:00 6.342330 2.542780
8 2009-01-09 01:25:00 4.333210 2.587875
9 2009-01-09 01:30:00 9.125948 4.792738

答案 1 :(得分:1)

您可以尝试根据A 的三位数删除“duplicates”

#Assuming df is already ordered based on time column
#Get the A values with three digits (0.000)
#Remove the rows which round(A,3) is the same for them
df[!(duplicated(round(df$A,3)) | duplicated(round(df$A,3), fromLast = TRUE)), ]

##                   time        A
## 1  2009-01-09 00:00:00 4.325775
## 2  2009-01-09 00:05:00 5.119950
## 3  2009-01-09 00:10:00 6.845995
## 4  2009-01-09 00:15:00 5.567840
## 11 2009-01-09 00:50:00 5.932950
## 12 2009-01-09 00:55:00 8.283950
## 13 2009-01-09 01:00:00 9.645665
## 14 2009-01-09 01:05:00 3.799550
## 15 2009-01-09 01:10:00 6.342330
## 16 2009-01-09 01:15:00 2.545995
## 17 2009-01-09 01:20:00 1.745335
## 18 2009-01-09 01:25:00 4.333210
## 19 2009-01-09 01:30:00 9.125948
## 20 2009-01-09 01:35:00 5.645568

对解决方案进行基准测试:

microbenchmark::microbenchmark(
    Dplyr  = df %>% 
                  mutate(n_minus_1 = lag(A), 
                  change = A - n_minus_1) %>% 
                  filter(change > 0.000) %>%
                  select(-n_minus_1),
    Base_R = df[!(duplicated(round(df$A,3)) | duplicated(round(df$A,3), fromLast = TRUE)), ])


## Unit: microseconds
##    expr       min        lq       mean     median         uq       max neval
##   Dplyr 16400.436 16775.964 17334.0477 17006.7475 17501.9980 20525.279   100
##  Base_R   203.259   207.494   227.6161   224.8175   241.5635   396.509   100