R从时间序列列表中查找具有非缺失值的时段

时间:2014-10-14 15:20:05

标签: r time-series missing-data

我有许多时间序列列表,每个时间序列都有一些缺失值。 这是一个简短的例子:

  x <- list(structure(c(NA, NA, 30, 1260, 504, 24, 132, 60, 766.8, 643.68, 
    54.96, 0, 9.48, 186.36, NA, NA, NA, NA, 723.24, 426.36, 198.96, 
    528.72, 29.04, 132, 60, 348, 5.04, 12, 144, 0), index = structure(c(189385200, 
    189471600, 189558000, 189644400, 189730800, 189817200, 189903600, 
    189990000, 190076400, 190162800, 190249200, 190335600, 190422000, 
    190508400, 190594800, 190681200, 190767600, 190854000, 190940400, 
    191026800, 191113200, 191199600, 191286000, 191372400, 191458800, 
    191545200, 191631600, 191718000, 191804400, 191890800), class = c("POSIXct", 
    "POSIXt")), class = "zoo"), structure(c(NA, NA, 144.96, 33.96, 
    10.08, 20.64, 12, NA, NA, 13.1904, 21.8784, 19.836, 30.8208, 
    96.3312, 57.3288, 30.0672, 25.9872, NA, NA, NA, NA, 56.3472, 
    79.4064, 35.64, 25.92, 44.88, 4.872, 78), index = structure(c(189385200, 
    189471600, 189558000, 189644400, 189730800, 189817200, 189903600, 
    189990000, 190076400, 190162800, 190249200, 190335600, 190422000, 
    190508400, 190594800, 190681200, 190767600, 190854000, 190940400, 
    191026800, 191113200, 191199600, 191286000, 191372400, 191458800, 
    191545200, 191631600, 191718000), class = c("POSIXct", "POSIXt"
    )), class = "zoo"), structure(c(25.8876260869565, 33.931, 12.50435, 
    19.721225, 17.5955, 10.296775, 6.862425, 5.321225, 10.0137, 14.7752, 
    11.35255, 7.0339, 5.2703, 4.672575, 3.777625, 3.26115, 2.97095, 
    NA, NA, NA, NA, NA, NA, 5.469975, 4.29925), index = structure(c(189385200, 
    189471600, 189558000, 189644400, 189730800, 189817200, 189903600, 
    189990000, 190076400, 190162800, 190249200, 190335600, 190422000, 
    190508400, 190594800, 190681200, 190767600, 190854000, 190940400, 
    191026800, 191113200, 191199600, 191286000, 191372400, 191458800
    ), class = c("POSIXct", "POSIXt")), class = "zoo"))

我需要查找没有任何时间序列包含缺失值的句点的开始和结束。对于上面的例子,我希望得到类似的东西:

START                  END
1976-01-03 23:00:00    1976-01-07 23:00:00
1976-01-10 23:00:00    1976-01-14 23:00:00
1976-01-24 23:00:00    1976-01-25 23:00:00

如果上一个(下一个)值为NA,我可以编写一个循环,在每个时间步查找非NA值,然后将时间戳写入数据帧的START(END)列。

我想知道是否已经存在任何现有功能(可能比正常循环更快)?

1 个答案:

答案 0 :(得分:0)

您可以使用以下逻辑采用矢量方法:

如果值为NA,则它不能是起点或终点。

如果值不是NA,则当且仅当:

时,它才是起点
  • “左”的值为NA,或
  • 这是该系列中的第一个值。

如果值不是NA,则当且仅当:

时,它才是终点
  • “right”的值为NA,或者
  • 这是该系列中的最后一个值。

所以逻辑看起来像:

start_end_points <- function(x){
  x_is_na <- is.na(x)
  prev_is_na_or_first <- c(TRUE, x_is_na[1:length(x)-1])
  next_is_na_or_last <- c(x_is_na[2:length(x)], TRUE)
  x_is_start_point <- !x_is_na & prev_is_na_or_first
  x_is_end_point <- !x_is_na & next_is_na_or_last
  data.frame(start_point=attributes(x)$index[x_is_start_point],
             end_point=attributes(x)$index[x_is_end_point])
}
lapply(x,start_end_points)

返回:

[[1]]
          start_point           end_point
1 1976-01-03 18:00:00 1976-01-14 18:00:00
2 1976-01-19 18:00:00 1976-01-30 18:00:00

[[2]]
          start_point           end_point
1 1976-01-03 18:00:00 1976-01-07 18:00:00
2 1976-01-10 18:00:00 1976-01-17 18:00:00
3 1976-01-22 18:00:00 1976-01-28 18:00:00

[[3]]
          start_point           end_point
1 1976-01-01 18:00:00 1976-01-17 18:00:00
2 1976-01-24 18:00:00 1976-01-25 18:00:00

(由于我认为系统时区设置,时间显示不同。)