我有一个以秒为单位的CSV值,如下所示:
"x","timestamp","value"
"1",2016-01-01 00:00:00,124
"2",2016-01-01 00:00:01,121
"3",2016-01-01 00:00:02,NA
"4",2016-01-01 00:00:03,NA
"5",2016-01-01 00:00:04,NA
"6",2016-01-01 00:00:05,123
"7",2016-01-01 00:00:06,122
"8",2016-01-01 00:00:07,124
"9",2016-01-01 00:00:08,NA
"10",2016-01-01 00:00:09,124
因此缺少一些数据并标记为NA
。现在我想制作缺失数据块长度的直方图。在给定的示例中,它将计算有多少缺失数据块的长度为1 sec (1)
,2 sec (0)
,3 sec (1)
等等。
在我的现实生活数据集中,垃圾箱/间隔会有所不同,我想到了这八个类别:
= 1 sec
2 to 5 sec
6 to 10 sec
11 to 30 sec
31 to 300 sec
301 to 3600 sec
3600 to 86400 sec
> 86400 sec
所以我的想法是让R代码遍历CSV的所有行,并且每当它检测到NA
值时,计算行直到它再次找到实际值。这八个类别可以是一个整数变量,只要检测到+1
的拟合块,就会计算NA
。
作为一个完整的R-noob,我根本不知道该怎么做。将非常感谢帮助:)
答案 0 :(得分:0)
我确信必须有一个时间序列解决方案,但要让你开始(使用set.seed生成可重复的随机值):
set.seed(42)
# Create some sample data
df <- data.frame(x = 1:100,
timestamp = seq(from = Sys.time() - 99, to = Sys.time(), by = "secs"),
value = sample(c(NA, 1:3), 100, replace = TRUE))
# Runs of identical data
runs <- rle(is.na(df$value))
# Those that are missing
missing <- which(runs$values)
# The end positions in the sequence that are missing
positions <- cumsum(runs$lengths)
# The start times
start <- df$timestamp[positions[missing] - runs$lengths[missing] + 1]
end <- df$timestamp[positions[missing]]
# Time difference
delta <- difftime(end, start, "seconds")
# Combine in a usable data.frame
output <- data.frame(StartRow = positions[missing] - runs$lengths[missing] + 1,
EndRow = positions[missing],
StartTime = start,
EndTime = end,
Duration = delta)
答案 1 :(得分:0)
也许这很有用
temp <- rle(diff(c(0,cumsum(is.na(df1$value)))))
runs <- temp$lengths[temp$values==1]
table(cut(runs,breaks = c(0,1,5,10,30,300,3600,86400,Inf),right = T))
hist(runs,breaks = c(1,5,10,30,300,3600,86400))