Question

我希望删除时间序列中的值，这些值由特定最小长度的NA块包围。

一些玩具数据：

x = seq(0,10,length.out = 100)
y = sin(x) + rnorm(length(x), mean=0, sd=0.1)
y[20:21] = rep(NA, 2)
y[50:54] = rep(NA, 5)
y[55:59] = seq(-0.1, -0.8, length.out = 5)
y[60:64] = rep(NA, 5)
y[90:91] = rep(NA, 2)

df <- data.frame(x, y)

我希望删除长度小于10且前面跟着5个或更多NA值的y值序列。

在我的玩具数据中，索引55-59处的y值具有（a）小于10个连续值，并且在两个侧具有（b）5 NA。因此，应该删除这个值块。

其他值由较长的值块组成和/或由NA（＆lt; 5）的短行包围，应保留。

使用红色删除的值进行绘图：

library(ggplot2)
ggplot(data = df, aes(x, y)) +
  geom_line() +
  geom_line(data = df[55:59, ], color = "red")

enter image description here

Answer 1

首先，我们将定义您指定的两个阈值。（我将第二个设置为4，因此我们可以与＆＃34;＆lt;＆＃34;和＆＃34;＆gt;＆＃34;一致，而不是容易出错＆＃34;＆lt;＆＃ 34;和＆＃34;＆gt; =＆＃34;）。

threshold.data <- 10
threshold.NA <- 4

现在，关键是在is.na(y)上使用游程编码。看看?rle。

foo <- rle(is.na(y))
foo

首先，我们提取可能的＆＃34;候选运行的NAs＆＃34;通过检查原始数据的位置NA（因此foo$values将为TRUE）和，我们指定的最小运行长度为NA：

candidate.runs.NA <- which(foo$values & foo$lengths>threshold.NA)

如果我们至少有两次NA次超过阈值，我们只想继续：

if ( diff(range(candidate.runs.NA)) >= 2 ) {

我们的目标是找到我们要删除的非NA数据的索引。为此，我们找到了（非NA）数据的候选运行＆＃34;。在第一步中，它包括上面确定的第一个和最后一个NA运行之间的所有运行：

    candidate.runs.data <- seq(candidate.runs.NA[1]+1,tail(candidate.runs.NA,1)-1)

我们通过两个标准来改进这一点。一方面，我们只需要非NA s的序列，另一方面，这些序列的长度应低于阈值：

    candidate.runs.data <- candidate.runs.data[!foo$values[candidate.runs.data] &
      foo$lengths[candidate.runs.data]<threshold.data]

在您的示例中，candidate.runs.data现在只有一个条目5.这意味着我们需要删除is.na序列的第5次运行中的所有数据。为此，我们需要恢复实际指数：

    indices.to.remove <- as.vector(sapply(candidate.runs.data,function(kk)
      seq(sum(foo$lengths[1:(kk-1)])+1,sum(foo$lengths[1:kk]))))

这有点复杂，因为我将其包裹在sapply()调用中，以防我们删除多个 candidate.runs.data。最后，我们删除了这些数据：

    y[indices.to.remove] <- NA
}
plot(x,y,"l")

enter image description here

现在，这似乎可以为您的具体示例做您想做的事情。您可能想要考虑边界情况下您想要发生的事情。例如，假设您的系列以非NA开头。如果您没有两次运行五个或更多NA s，但三个或五个会发生什么>？在＆＃34; long＆＃34;之间有或没有较短的NA运行。运行？此脚本将考虑在第一个和最后一个＆＃34; long＆＃34;之间最多运行九个非NA s。公平竞争。

Answer 2

您可以将时间序列视为字符串，并在此处使用正则表达式的优点。借助st <- paste0(as.integer(is.na(df$y)), collapse = '') # [1] "0000000000000000000110000000000000000000000000000111110000011111000000000000000000000000011000000000" require("stringr") str_locate_all(st, "1{5,}0{,10}1{5,}") # pattern of at least 5 ones, then not more than 10 zeros, then again not less than 5 ones # output will be: # [[1]] # start end # [1,] 50 64包中的strtok函数，可以轻松解决问题。

// "String1::String2:String3:String4::String5" with delimiter "::" will produce
// "String1\0\0String2:String3:String4\0\0String5"
// And words should contain a pointer to the first S, the second S and the last S.
char **strToWordArray(char *str, const char *delimiter)
{
  char **words;
  int nwords = countWords(str, delimiter); //I let you decide how you want to do this
  words = malloc(sizeof(*words) * (nwords + 1));

  int w = 0;
  int len = strlen(delimiter);
  words[w++] = str;
  while (*str != NULL)
  {
    if (strncmp(str, delimiter, len) == 0)
    {
      for (int i = 0; i < len; i++)
      {
        *(str++) = 0;
      }
      if (*str != 0)
        words[w++] = str;
      else
        str--; //Anticipate wrong str++ down;
    }
    str++;
  }
  words[w] = NULL;
  return words;
}

Answer 3

另一种rle可能性：

长度为NA：

r <- rle(is.na(y))

非values（rle）的

NA（sensu FALSE）应从数据中删除（运行时间小于10，并且之前和之后都是运行）超过4的NA替换为TRUE：

r$values[!r$values & r$lengths < 10 &
           c(0, head(r$lengths, -1)) > 4 &
           c(tail(r$lengths, -1), 0) > 4] <- TRUE

然后将更新后的rle values与lengths一起使用，生成一个布尔索引，用NA替换相关的y值：

y[rep(r$values, r$lengths)] <- NA

使用OP的绘图代码： enter image description here

Answer 4

complete.cases（）会对你好吗？此函数使所有带NA的行都消失。也许对你来说太激烈了......

删除由一定数量的NA包围的值

4 个答案: