我有一个简单的数据框。
a <- c("06/12/2012 06:00","06/12/2012 06:05","06/12/2012 06:10","06/12/2012 06:15","06/12/2012 06:20","06/12/2012 06:25",
"06/12/2012 06:30","06/12/2012 06:35","06/12/2012 06:40","06/12/2012 06:45","06/12/2012 06:50","06/12/2012 06:55",
"06/12/2012 07:00","06/12/2012 07:05","06/12/2012 07:10","06/12/2012 07:15","06/12/2012 07:20","06/12/2012 07:25",
"06/12/2012 07:30","06/12/2012 07:35","06/12/2012 07:40","06/12/2012 07:45","06/12/2012 07:50","06/12/2012 07:55",
"06/12/2012 08:00")
a <- strptime(a, "%d/%m/%Y %H:%M")
b <-c("1","0","0","0","2","0","0","0","3","0","0","0","0","0","1","2","5","6","0","0","0","0","6","10","2")
df1 <- data.frame(a,b)
当有效数据不足时,我想使用R删除部分数据帧。每5分钟记录一次数据。如果在'b'列中仅记录零时有20分钟或更长时间的连续数据,则可以从我的最终数据帧中删除这些数据。
如果有人有任何想法可以帮助我,我会非常感激。
答案 0 :(得分:3)
另一个,仍在使用rle
:
is.zero <- df1$b == 0
is.zero.rle <- rle(is.zero)
df1[rep(is.zero.rle$lengths, is.zero.rle$lengths) * is.zero < 4, ]
如果我显示中间结果,可能会有所帮助:
rep(is.zero.rle$lengths, is.zero.rle$lengths) * is.zero
# [1] 0 3 3 3 0 3 3 3 0 5 5 5 5 5 0 0 0 0 4 4 4 4 0 0 0
答案 1 :(得分:2)
使用rle
的一种解决方案(正如Ben在评论中提到的那样)
# get rle
t <- rle(as.numeric(as.character(df1$b)))
# check for condition. NOTE: here I assume all are 5 minute intervals!!
# So, if rle length >= 4, then its >= 20 minute interval
p <- which(t$values == 0 & t$lengths >= 4)
w <- cumsum(t$lengths)
o <- unlist(lapply(p, function(x) {
c((w[x-1]+1):w[x])
}))
df1[-o, ]
# a b
# 1 2012-12-06 06:00:00 1
# 2 2012-12-06 06:05:00 0
# 3 2012-12-06 06:10:00 0
# 4 2012-12-06 06:15:00 0
# 5 2012-12-06 06:20:00 2
# 6 2012-12-06 06:25:00 0
# 7 2012-12-06 06:30:00 0
# 8 2012-12-06 06:35:00 0
# 9 2012-12-06 06:40:00 3
# 15 2012-12-06 07:10:00 1
# 16 2012-12-06 07:15:00 2
# 17 2012-12-06 07:20:00 5
# 18 2012-12-06 07:25:00 6
# 23 2012-12-06 07:50:00 6
# 24 2012-12-06 07:55:00 10
# 25 2012-12-06 08:00:00 2