通过分组将缺失值替换为多个列的先前值

时间:2017-10-09 11:11:17

标签: r dplyr missing-data

我有一个包含6个变量的数据框。对于每个列,具有一些缺失值的同一组的数据相同。我想通过为每个变量复制相同组的值来填充这些缺失值。如果特定组缺少所有值,则应填写上述组的值。所以,我希望结果为df_complete。

这是我尝试过的但是当第一次观察任何一组失踪时失败了。无法弄清楚它有什么问题。

set.seed(123)
df <- data.frame(matrix(rnorm(100), ncol = 5))
df$Group <- letters[1:20]
df <- df[rep(seq_len(nrow(df)), sample(1:10, 20, replace = T)),]
df_complete <- df
df$X1[sample(1:nrow(df), 15)] <- NA
df$X2[sample(1:nrow(df), 10)] <- NA
df$X3[sample(1:nrow(df), 25)] <- NA
df$X4[sample(1:nrow(df), 10)] <- NA
df$X5[sample(1:nrow(df), 15)] <- NA

lvcf <- function(x)
{
  miss_ind <- which(is.na(x))

  if(length(miss_ind) != 0)
  {
    if(miss_ind[1]==1)
    {
      ind1 <- which(!is.na(x))[1]
      x[1] <- x[ind1]
      miss_ind <- which(is.na(x))
    }

    for(i in 1:length(miss_ind))
    {
      x[miss_ind[i]] <- x[miss_ind[i]-1]
    }
  }      
  return(x)
}

df_complete <- df %>%
  group_by(Group) %>%
  sapply(lvcf)

1 个答案:

答案 0 :(得分:2)

var p = Expression.Parameter(typeof(Employee)); var m = Expression.Property(p, "Salary"); var e = Expression.Lambda(m, p); var selector = (Expression<Func<Employee,decimal>>)e; 具有处理zoo的问题na.locf的功能。

last observation carried forward

请注意library(zoo) df_complete <- df %>% group_by(Group) %>% na.locf(., na.rm = FALSE) head(df_complete) ## A tibble: 6 x 6 ## Groups: Group [2] # X1 X2 X3 X4 X5 Group # <chr> <chr> <chr> <chr> <chr> <chr> #1 -0.56047565 -1.06782371 -0.69470698 <NA> 0.005764186 a #2 -0.56047565 -1.06782371 -0.69470698 0.37963948 0.005764186 a #3 -0.56047565 -1.06782371 -0.69470698 0.37963948 0.005764186 a #4 -0.23017749 -0.21797491 -0.20791728 -0.50232345 0.385280401 b #5 -0.23017749 -0.21797491 -0.20791728 -0.50232345 0.385280401 b #6 -0.23017749 -0.21797491 -0.20791728 -0.50232345 0.385280401 b 列中的<NA>

修改
根据下面的OP评论和G.Grothendieck的回答,以下内容删除了所有X4值。只需使用带参数NA的第二个na.locf

fromLast = TRUE

编辑2
遵循OP发现的错误,这是仅使用df_complete <- df %>% group_by(Group) %>% na.locf(., na.rm = FALSE) %>% na.locf(., fromLast = TRUE) head(df_complete) ## A tibble: 6 x 6 ## Groups: Group [2] # X1 X2 X3 X4 X5 Group # <chr> <chr> <chr> <chr> <chr> <chr> #1 -0.56047565 -1.06782371 -0.69470698 0.37963948 0.005764186 a #2 -0.56047565 -1.06782371 -0.69470698 0.37963948 0.005764186 a #3 -0.56047565 -1.06782371 -0.69470698 0.37963948 0.005764186 a #4 -0.23017749 -0.21797491 -0.20791728 -0.50232345 0.385280401 b #5 -0.23017749 -0.21797491 -0.20791728 -0.50232345 0.385280401 b #6 -0.23017749 -0.21797491 -0.20791728 -0.50232345 0.385280401 b 的解决方案。我会创建一个新的df,其值为base R,每组开始但第一组,即组NA

a