我想过滤具有变量id
,year
和value
的基数。
structure(list(id = c(
70101L, 70101L, 70101L, 70102L, 70102L,
70102L, 70102L, 70102L, 70103L, 70103L, 70103L, 70103L, 70103L,
70103L, 70104L, 70104L, 70104L, 70104L, 70104L, 70104L
), year = c(
2013L,
2014L, 2015L, 2013L, 2014L, 2015L, 2016L, 2017L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L
), value = c(
4.68, 4.76, 5.14, 4.48, 4.71, 4.24, 5.13, 5.22,
5.13, 5.05, 4.96, 5.09, 8.09, 7.82, 3.57, 7.96, 1.83, 4.56, 11,
10.6
)), row.names = c(NA, -20L), class = "data.frame")
过滤仅包含2013年至2018年完整信息的ID
id year value
<fct> <dbl> <dbl>
1 070103 2013 5.13
2 070103 2014 5.05
3 070103 2015 4.96
4 070103 2016 5.09
5 070103 2017 8.09
6 070103 2018 7.82
7 070104 2013 3.57
8 070104 2014 7.96
9 070104 2015 1.83
10 070104 2016 4.56
11 070104 2017 11.0
12 070104 2018 10.6
答案 0 :(得分:1)
可以这样实现:
library(dplyr)
d <- structure(list(id = c(
70101L, 70101L, 70101L, 70102L, 70102L,
70102L, 70102L, 70102L, 70103L, 70103L, 70103L, 70103L, 70103L,
70103L, 70104L, 70104L, 70104L, 70104L, 70104L, 70104L
), year = c(
2013L,
2014L, 2015L, 2013L, 2014L, 2015L, 2016L, 2017L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L
), value = c(
4.68, 4.76, 5.14, 4.48, 4.71, 4.24, 5.13, 5.22,
5.13, 5.05, 4.96, 5.09, 8.09, 7.82, 3.57, 7.96, 1.83, 4.56, 11,
10.6
)), row.names = c(NA, -20L), class = "data.frame")
d %>%
group_by(id) %>%
filter(all(c(2013:2018) %in% year))
#> # A tibble: 12 x 3
#> # Groups: id [2]
#> id year value
#> <int> <int> <dbl>
#> 1 70103 2013 5.13
#> 2 70103 2014 5.05
#> 3 70103 2015 4.96
#> 4 70103 2016 5.09
#> 5 70103 2017 8.09
#> 6 70103 2018 7.82
#> 7 70104 2013 3.57
#> 8 70104 2014 7.96
#> 9 70104 2015 1.83
#> 10 70104 2016 4.56
#> 11 70104 2017 11
#> 12 70104 2018 10.6
答案 1 :(得分:1)
另一种方法是使用变量来检查年份中是否有这样的连续序列:
library(dplyr)
#Code
df <- df %>% group_by(id) %>%
mutate(Diff=c(1,diff(year)),
Index=sum(Diff)) %>%
filter(Index==6) %>% select(-c(Index,Diff))
输出:
# A tibble: 12 x 3
# Groups: id [2]
id year value
<int> <int> <dbl>
1 70103 2013 5.13
2 70103 2014 5.05
3 70103 2015 4.96
4 70103 2016 5.09
5 70103 2017 8.09
6 70103 2018 7.82
7 70104 2013 3.57
8 70104 2014 7.96
9 70104 2015 1.83
10 70104 2016 4.56
11 70104 2017 11
12 70104 2018 10.6
使用了一些数据:
#Data
df <- structure(list(id = c(70101L, 70101L, 70101L, 70102L, 70102L,
70102L, 70102L, 70102L, 70103L, 70103L, 70103L, 70103L, 70103L,
70103L, 70104L, 70104L, 70104L, 70104L, 70104L, 70104L), year = c(2013L,
2014L, 2015L, 2013L, 2014L, 2015L, 2016L, 2017L, 2013L, 2014L,
2015L, 2016L, 2017L, 2018L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L), value = c(4.68, 4.76, 5.14, 4.48, 4.71, 4.24, 5.13, 5.22,
5.13, 5.05, 4.96, 5.09, 8.09, 7.82, 3.57, 7.96, 1.83, 4.56, 11,
10.6)), row.names = c(NA, -20L), class = "data.frame")
答案 2 :(得分:1)
使用基本R函数,您可以这样做:
new_df <- do.call("rbind", split(df, df$id)[sapply(split(df, df$id), function (x) {
all(2013:2018 %in% x$year)
})])
rownames(new_df) <- NULL
new_df
# id year value
# 1 70103 2013 5.13
# 2 70103 2014 5.05
# 3 70103 2015 4.96
# 4 70103 2016 5.09
# 5 70103 2017 8.09
# 6 70103 2018 7.82
# 7 70104 2013 3.57
# 8 70104 2014 7.96
# 9 70104 2015 1.83
# 10 70104 2016 4.56
# 11 70104 2017 11.00
# 12 70104 2018 10.60