我必须在10年的时间序列中检测出有故障的传感器,时间步长为5分钟。代码应该是最快的。
例如:
time<-seq(from=as.POSIXct("2009-01-09 00:00"),to=as.POSIXct("2009-01-09 01:35"), by= "5 min")
A<-c(4.325775,5.11995,6.845995,5.56784, 1.845995,1.845995,1.845995,1.845995,9.45555,9.45558,
5.93295,8.28395,9.645665,3.79955,6.34233,2.545995,1.745335,4.33321,9.125948,5.645568)
df<-data.frame(time,A)
df
time A
1 2009-01-09 00:00:00 4.325775
2 2009-01-09 00:05:00 5.119950
3 2009-01-09 00:10:00 6.845995
4 2009-01-09 00:15:00 5.567840
5 2009-01-09 00:20:00 1.845995
6 2009-01-09 00:25:00 1.845995
7 2009-01-09 00:30:00 1.845995
8 2009-01-09 00:35:00 1.845995
9 2009-01-09 00:40:00 9.455550
10 2009-01-09 00:45:00 9.455580
11 2009-01-09 00:50:00 5.932950
12 2009-01-09 00:55:00 8.283950
13 2009-01-09 01:00:00 9.645665
14 2009-01-09 01:05:00 3.799550
15 2009-01-09 01:10:00 6.342330
16 2009-01-09 01:15:00 2.545995
17 2009-01-09 01:20:00 1.745335
18 2009-01-09 01:25:00 4.333210
19 2009-01-09 01:30:00 9.125948
20 2009-01-09 01:35:00 5.645568
现在我想删除没有变化的成功事件之间的行,例如应删除行5,6,7,8
。如果变化非常小(小于0.001),那么它也应该被删除(行9 and 10
)。
我试图使用rle,但不知道是否有可能识别并删除没有变化或变化足够小的行。
答案 0 :(得分:2)
这是dplyr中的一个简单解决方案:
df %>%
mutate(n_minus_1 = lag(A),
change = A - n_minus_1) %>%
filter(change > 0.000)
time A n_minus_1 change
1 2009-01-09 00:05:00 5.119950 4.325775 0.794175
2 2009-01-09 00:10:00 6.845995 5.119950 1.726045
3 2009-01-09 00:40:00 9.455550 1.845995 7.609555
4 2009-01-09 00:45:00 9.455580 9.455550 0.000030
5 2009-01-09 00:55:00 8.283950 5.932950 2.351000
6 2009-01-09 01:00:00 9.645665 8.283950 1.361715
7 2009-01-09 01:10:00 6.342330 3.799550 2.542780
8 2009-01-09 01:25:00 4.333210 1.745335 2.587875
9 2009-01-09 01:30:00 9.125948 4.333210 4.792738
您当然可以删除n_minus_1
col:
df %>%
mutate(n_minus_1 = lag(A),
change = A - n_minus_1) %>%
filter(change > 0.000) %>%
select(-n_minus_1)
time A change
1 2009-01-09 00:05:00 5.119950 0.794175
2 2009-01-09 00:10:00 6.845995 1.726045
3 2009-01-09 00:40:00 9.455550 7.609555
4 2009-01-09 00:45:00 9.455580 0.000030
5 2009-01-09 00:55:00 8.283950 2.351000
6 2009-01-09 01:00:00 9.645665 1.361715
7 2009-01-09 01:10:00 6.342330 2.542780
8 2009-01-09 01:25:00 4.333210 2.587875
9 2009-01-09 01:30:00 9.125948 4.792738
答案 1 :(得分:1)
您可以尝试根据A 的三位数删除“duplicates” :
#Assuming df is already ordered based on time column
#Get the A values with three digits (0.000)
#Remove the rows which round(A,3) is the same for them
df[!(duplicated(round(df$A,3)) | duplicated(round(df$A,3), fromLast = TRUE)), ]
## time A
## 1 2009-01-09 00:00:00 4.325775
## 2 2009-01-09 00:05:00 5.119950
## 3 2009-01-09 00:10:00 6.845995
## 4 2009-01-09 00:15:00 5.567840
## 11 2009-01-09 00:50:00 5.932950
## 12 2009-01-09 00:55:00 8.283950
## 13 2009-01-09 01:00:00 9.645665
## 14 2009-01-09 01:05:00 3.799550
## 15 2009-01-09 01:10:00 6.342330
## 16 2009-01-09 01:15:00 2.545995
## 17 2009-01-09 01:20:00 1.745335
## 18 2009-01-09 01:25:00 4.333210
## 19 2009-01-09 01:30:00 9.125948
## 20 2009-01-09 01:35:00 5.645568
对解决方案进行基准测试:
microbenchmark::microbenchmark(
Dplyr = df %>%
mutate(n_minus_1 = lag(A),
change = A - n_minus_1) %>%
filter(change > 0.000) %>%
select(-n_minus_1),
Base_R = df[!(duplicated(round(df$A,3)) | duplicated(round(df$A,3), fromLast = TRUE)), ])
## Unit: microseconds
## expr min lq mean median uq max neval
## Dplyr 16400.436 16775.964 17334.0477 17006.7475 17501.9980 20525.279 100
## Base_R 203.259 207.494 227.6161 224.8175 241.5635 396.509 100