在特定条件下对主题重复测量进行过滤的一种提神方法?

时间:2019-12-22 15:50:05

标签: r tidyverse

我有一个针对以下主题的重复测量数据集:

# Data
subject <- c("A", "A", "A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "E", "E", "E", "E", "E")
testDate <- c(2012, 2013, 2014, 2013, 2014, 2015, 2016, 2012, 2013, 2016, 2017, 2018, 2012, 2013, 2014, 2015, 2016)
var1 <- rnorm(n = length(subject), mean = 0, sd = 1)

df <- data.frame(subject, testDate, var1)
df

  subject testDate       var1
1        A     2012  1.2405521
2        A     2013 -1.0518959
3        A     2014 -0.5830443
4        B     2013  1.3279566
5        B     2014 -0.5353911
6        B     2015 -0.3799306
7        B     2016  1.7456606
8        C     2012 -1.1573785
9        C     2013  0.9105006
10       D     2016  0.4129998
11       D     2017  0.4370711
12       D     2018 -0.1956156
13       E     2012 -0.1618883
14       E     2013  0.2141332
15       E     2014 -0.1341796
16       E     2015  0.0115121
17       E     2016  0.6919945

我想过滤所有符合特定条件的主题数据。例如,只想对2012年首次考试的人进行所有重复测量。

我创建了一个主题testID,如下所示:

# create test index

library(tidyverse)

df <- df %>%
  group_by(subject) %>%
  mutate(testID = seq_along(subject))

df

# A tibble: 17 x 4
# Groups:   subject [5]
   subject testDate    var1 testID
   <fct>      <dbl>   <dbl>  <int>
 1 A           2012  1.24        1
 2 A           2013 -1.05        2
 3 A           2014 -0.583       3
 4 B           2013  1.33        1
 5 B           2014 -0.535       2
 6 B           2015 -0.380       3
 7 B           2016  1.75        4
 8 C           2012 -1.16        1
 9 C           2013  0.911       2
10 D           2016  0.413       1
11 D           2017  0.437       2
12 D           2018 -0.196       3
13 E           2012 -0.162       1
14 E           2013  0.214       2
15 E           2014 -0.134       3
16 E           2015  0.0115      4
17 E           2016  0.692       5

我通常只会找到每个主题,然后将它们过滤掉,就像这样:

# get those who had their first test in 2012

names <- df %>% filter(testID == 1 & testDate == 2012) %>% pull(subject)

# filter those subjects out

df %>% filter(subject %in% names)

# A tibble: 10 x 4
# Groups:   subject [3]
   subject testDate    var1 testID
   <fct>      <dbl>   <dbl>  <int>
 1 A           2012  1.24        1
 2 A           2013 -1.05        2
 3 A           2014 -0.583       3
 4 C           2012 -1.16        1
 5 C           2013  0.911       2
 6 E           2012 -0.162       1
 7 E           2013  0.214       2
 8 E           2014 -0.134       3
 9 E           2015  0.0115      4
10 E           2016  0.692       5

我想知道是否可以用更少的代码行来实现更快/更整洁的方式。我有一个庞大的数据集,所以如果我可以缩小流程,那可能会更有效率。

2 个答案:

答案 0 :(得分:3)

一个选项可能是:

df %>%
 group_by(subject) %>%
 filter(min(testDate) == 2012)

   subject testDate    var1
   <chr>      <dbl>   <dbl>
 1 A           2012 -1.48  
 2 A           2013  1.58  
 3 A           2014 -0.957 
 4 C           2012 -0.628 
 5 C           2013 -0.106 
 6 E           2012 -0.780 
 7 E           2013  0.0120
 8 E           2014 -0.152 
 9 E           2015 -0.703 
10 E           2016  1.19 

答案 1 :(得分:2)

我们可以使用match查找'2012'的索引,并在按'主题'分组后检查filter中的索引是否等于1。

library(dplyr)
df %>%
    group_by(subject) %>%
    filter(match(2012, testDate)==1)
# A tibble: 10 x 3
# Groups:   subject [3]
#   subject testDate   var1
#   <fct>      <dbl>  <dbl>
# 1 A           2012 -1.19 
# 2 A           2013  0.972
# 3 A           2014  0.595
# 4 C           2012 -0.165
# 5 C           2013  0.860
# 6 E           2012 -0.705
# 7 E           2013  1.21 
# 8 E           2014  0.500
# 9 E           2015 -0.766
#10 E           2016 -0.757