我有一个针对以下主题的重复测量数据集:
# Data
subject <- c("A", "A", "A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "E", "E", "E", "E", "E")
testDate <- c(2012, 2013, 2014, 2013, 2014, 2015, 2016, 2012, 2013, 2016, 2017, 2018, 2012, 2013, 2014, 2015, 2016)
var1 <- rnorm(n = length(subject), mean = 0, sd = 1)
df <- data.frame(subject, testDate, var1)
df
subject testDate var1
1 A 2012 1.2405521
2 A 2013 -1.0518959
3 A 2014 -0.5830443
4 B 2013 1.3279566
5 B 2014 -0.5353911
6 B 2015 -0.3799306
7 B 2016 1.7456606
8 C 2012 -1.1573785
9 C 2013 0.9105006
10 D 2016 0.4129998
11 D 2017 0.4370711
12 D 2018 -0.1956156
13 E 2012 -0.1618883
14 E 2013 0.2141332
15 E 2014 -0.1341796
16 E 2015 0.0115121
17 E 2016 0.6919945
我想过滤所有符合特定条件的主题数据。例如,只想对2012年首次考试的人进行所有重复测量。
我创建了一个主题testID,如下所示:
# create test index
library(tidyverse)
df <- df %>%
group_by(subject) %>%
mutate(testID = seq_along(subject))
df
# A tibble: 17 x 4
# Groups: subject [5]
subject testDate var1 testID
<fct> <dbl> <dbl> <int>
1 A 2012 1.24 1
2 A 2013 -1.05 2
3 A 2014 -0.583 3
4 B 2013 1.33 1
5 B 2014 -0.535 2
6 B 2015 -0.380 3
7 B 2016 1.75 4
8 C 2012 -1.16 1
9 C 2013 0.911 2
10 D 2016 0.413 1
11 D 2017 0.437 2
12 D 2018 -0.196 3
13 E 2012 -0.162 1
14 E 2013 0.214 2
15 E 2014 -0.134 3
16 E 2015 0.0115 4
17 E 2016 0.692 5
我通常只会找到每个主题,然后将它们过滤掉,就像这样:
# get those who had their first test in 2012
names <- df %>% filter(testID == 1 & testDate == 2012) %>% pull(subject)
# filter those subjects out
df %>% filter(subject %in% names)
# A tibble: 10 x 4
# Groups: subject [3]
subject testDate var1 testID
<fct> <dbl> <dbl> <int>
1 A 2012 1.24 1
2 A 2013 -1.05 2
3 A 2014 -0.583 3
4 C 2012 -1.16 1
5 C 2013 0.911 2
6 E 2012 -0.162 1
7 E 2013 0.214 2
8 E 2014 -0.134 3
9 E 2015 0.0115 4
10 E 2016 0.692 5
我想知道是否可以用更少的代码行来实现更快/更整洁的方式。我有一个庞大的数据集,所以如果我可以缩小流程,那可能会更有效率。
答案 0 :(得分:3)
一个选项可能是:
df %>%
group_by(subject) %>%
filter(min(testDate) == 2012)
subject testDate var1
<chr> <dbl> <dbl>
1 A 2012 -1.48
2 A 2013 1.58
3 A 2014 -0.957
4 C 2012 -0.628
5 C 2013 -0.106
6 E 2012 -0.780
7 E 2013 0.0120
8 E 2014 -0.152
9 E 2015 -0.703
10 E 2016 1.19
答案 1 :(得分:2)
我们可以使用match
查找'2012'的索引,并在按'主题'分组后检查filter
中的索引是否等于1。
library(dplyr)
df %>%
group_by(subject) %>%
filter(match(2012, testDate)==1)
# A tibble: 10 x 3
# Groups: subject [3]
# subject testDate var1
# <fct> <dbl> <dbl>
# 1 A 2012 -1.19
# 2 A 2013 0.972
# 3 A 2014 0.595
# 4 C 2012 -0.165
# 5 C 2013 0.860
# 6 E 2012 -0.705
# 7 E 2013 1.21
# 8 E 2014 0.500
# 9 E 2015 -0.766
#10 E 2016 -0.757