根据滞后的分组值确定缺失值

时间:2017-06-05 19:23:13

标签: r dplyr missing-data

我需要根据分组来填充基于先前和/或前向值的缺失值。我想用dplyr完成这个任务(尽管data.table解决方案也会受到欢迎)。

示例数据:

testing <- tibble(key = c(10,10,10,10,10,10,20,20,20,20,20,20),
                  year = c(15,15,16,16,17,17,15,15,16,16,17,17),
                  name = c("abc","abc","","","dfg","dfg",
                          "","","nmm","nmm","",""),
                  is_name = c(1,1,0,0,1,1,0,0,0,0,0,0))

     key  year  name is_name
   <dbl> <dbl> <chr>   <dbl>
1     10    15   abc       1
2     10    15   abc       1
3     10    16             0
4     10    16             0
5     10    17   dfg       1
6     10    17   dfg       1
7     20    15             0
8     20    15             0
9     20    16   nmm       0
10    20    16   nmm       0
11    20    17             0
12    20    17             0

我希望填写缺少的名称(name),如果同一year的前key标记为is_name==1,则填写缺失的名称它。 所以输出可以是:

     key  year  name is_name name_new
   <dbl> <dbl> <chr>   <dbl>    <chr>
1     10    15   abc       1      abc
2     10    15   abc       1      abc
3     10    16             0      abc
4     10    16             0      abc
5     10    17   dfg       1      dfg
6     10    17   dfg       1      dfg
7     20    15             0         
8     20    15             0         
9     20    16   nmm       0      nmm
10    20    16   nmm       0      nmm
11    20    17             0         
12    20    17             0 

我尝试使用lagleap,但它没有正确地超越群组(key)。

谢谢!

1 个答案:

答案 0 :(得分:1)

这可能适合你

library(dplyr)
library(zoo)

testing <- testing %>%
           arrange(key, year) %>%
           mutate(name = ifelse(name == "", NA, name),
                  is_name = ifelse(is_name == 0, NA, is_name)) %>%
           group_by(key) %>%
           mutate(newname = ifelse((is.na(name) & na.locf(is_name, na.rm = FALSE) == 1), na.locf(name, na.rm = FALSE), name),
                  is_name = ifelse(is.na(is_name),0,is_name))