我正在尝试填充数据框中的缺失值,但我不想要所有可能的变量组合 - 我只想基于三个变量的分组来填充:coursecode,year和week。
我已经查看了tidyr库中的complete(),但即使看了Using tidyr::complete with group_by和https://blog.rstudio.org/2015/09/13/tidyr-0-3-0/
,我也无法让它工作我有观察员在一年中的特定星期收集不同课程的数据。例如,数据可能会在我的大型数据集中收集数周1-10,但我只关心在特定课程年组合中发生的缺失周数。 例如,
示例:
library(dplyr)
library(tidyr)
df <- data.frame(coursecode = rep(c("A", "B"), each = 6),
year = rep(c(2000, 2000, 2000, 2001, 2001, 2001), 2),
week = c(1, 3, 4, 1, 2, 3, 2, 3, 5, 3, 4, 5),
values = c(1:12),
othervalues = c(12:23),
region = "Big")
df
coursecode year week values othervalues region
1 A 2000 1 1 12 Big
2 A 2000 3 2 13 Big
3 A 2000 4 3 14 Big
4 A 2001 1 4 15 Big
5 A 2001 2 5 16 Big
6 A 2001 3 6 17 Big
7 B 2000 2 7 18 Big
8 B 2000 3 8 19 Big
9 B 2000 5 9 20 Big
10 B 2001 3 10 21 Big
11 B 2001 4 11 22 Big
12 B 2001 5 12 23 Big
尝试完成:(不是我想要的输出)
df %>%
complete(coursecode, year, region, nesting(week))
# A tibble: 20 x 6
coursecode year region week values othervalues
<fctr> <dbl> <fctr> <dbl> <int> <int>
1 A 2000 Big 1 1 12
2 A 2000 Big 2 NA NA
3 A 2000 Big 3 2 13
4 A 2000 Big 4 3 14
5 A 2000 Big 5 NA NA
6 A 2001 Big 1 4 15
7 A 2001 Big 2 5 16
8 A 2001 Big 3 6 17
9 A 2001 Big 4 NA NA
10 A 2001 Big 5 NA NA
11 B 2000 Big 1 NA NA
12 B 2000 Big 2 7 18
13 B 2000 Big 3 8 19
14 B 2000 Big 4 NA NA
15 B 2000 Big 5 9 20
16 B 2001 Big 1 NA NA
17 B 2001 Big 2 NA NA
18 B 2001 Big 3 10 21
19 B 2001 Big 4 11 22
20 B 2001 Big 5 12 23
所需的输出
coursecode year region week values othervalues
<fctr> <dbl> <fctr> <dbl> <int> <int>
1 A 2000 Big 1 1 12
2 A 2000 Big 2 NA NA
3 A 2000 Big 3 2 13
4 A 2000 Big 4 3 14
5 A 2001 Big 1 4 15
6 A 2001 Big 2 5 16
7 A 2001 Big 3 6 17
8 B 2000 Big 2 7 18
9 B 2000 Big 3 8 19
10 B 2000 Big 4 NA NA
11 B 2000 Big 5 9 20
12 B 2001 Big 3 10 21
13 B 2001 Big 4 11 22
14 B 2001 Big 5 12 23
答案 0 :(得分:7)
我们可以尝试使用expand
和left_join
library(dplyr)
library(tidyr)
df %>%
group_by(coursecode, year, region) %>%
expand(week = full_seq(week, 1)) %>%
left_join(., df)
# coursecode year region week values othervalues
# <fctr> <dbl> <fctr> <dbl> <int> <int>
#1 A 2000 Big 1 1 12
#2 A 2000 Big 2 NA NA
#3 A 2000 Big 3 2 13
#4 A 2000 Big 4 3 14
#5 A 2001 Big 1 4 15
#6 A 2001 Big 2 5 16
#7 A 2001 Big 3 6 17
#8 B 2000 Big 2 7 18
#9 B 2000 Big 3 8 19
#10 B 2000 Big 4 NA NA
#11 B 2000 Big 5 9 20
#12 B 2001 Big 3 10 21
#13 B 2001 Big 4 11 22
#14 B 2001 Big 5 12 23
答案 1 :(得分:0)
当OP使用complete()
(基于expand()
和left_join()
)时,与@akrun的解决方案相比,人们可以坚持使用它并节省编写额外的代码行:
# example data
df <- data.frame(coursecode = rep(c("A", "B"), each = 6),
year = rep(c(2000, 2000, 2000, 2001, 2001, 2001), 2),
week = c(1, 3, 4, 1, 2, 3, 2, 3, 5, 3, 4, 5),
values = c(1:12),
othervalues = c(12:23),
region = "Big")
# complete by group
library(dplyr)
library(tidyr)
df %>%
group_by(coursecode, year, region) %>%
complete(week = full_seq(week, 1))
#> # A tibble: 14 x 6
#> # Groups: coursecode, year, region [4]
#> coursecode year region week values othervalues
#> <chr> <dbl> <chr> <dbl> <int> <int>
#> 1 A 2000 Big 1 1 12
#> 2 A 2000 Big 2 NA NA
#> 3 A 2000 Big 3 2 13
#> 4 A 2000 Big 4 3 14
#> 5 A 2001 Big 1 4 15
#> 6 A 2001 Big 2 5 16
#> 7 A 2001 Big 3 6 17
#> 8 B 2000 Big 2 7 18
#> 9 B 2000 Big 3 8 19
#> 10 B 2000 Big 4 NA NA
#> 11 B 2000 Big 5 9 20
#> 12 B 2001 Big 3 10 21
#> 13 B 2001 Big 4 11 22
#> 14 B 2001 Big 5 12 23
由reprex package(v0.3.0)于2020-10-29创建