Question

问题：

我有一个包含多个分组变量的数据集，对于每个组，我需要从变量value为>=5的第一个实例中选择行。

原始数据如下所示：

   id type time value
 1:  1    1    1 1.002
 2:  1    1    2 4.019
 3:  1    1    3 5.048
 4:  1    1    4 6.005
 5:  1    1    5 4.108
 6:  1    1    6 3.509
 7:  1    2    1 2.104
 8:  1    2    2 6.001
 9:  1    2    3 5.903
10:  1    2    4 5.025
11:  1    2    5 3.907
12:  1    2    6 4.569
13:  5    1    1 4.006
14:  5    1    2 4.019
15:  5    1    3 4.908
16:  5    1    4 6.001
17:  5    1    5 4.199
18:  5    1    6 4.999
19:  5    2    1 0.009
20:  5    2    2 2.093
21:  5    2    3 3.081
22:  5    2    4 4.014
23:  5    2    5 4.998
24:  5    2    6 5.041

可能的解决方案：

为了使用已接受的dplyr回答in this question，我添加了一个逻辑变量来帮助我选择行并应用过滤器：

sample.dt$state <- FALSE
sample.dt$state[sample.dt$value >=5] <- TRUE

sample.dt%>%
  group_by(id, type)%>%
  filter(cumsum(state)>0)

它确实给了我所需要的东西：

       id  type  time value state
    <dbl> <dbl> <dbl> <dbl> <lgl>
 1      1     1     3 5.048  TRUE
 2      1     1     4 6.005  TRUE
 3      1     1     5 4.108 FALSE
 4      1     1     6 3.509 FALSE
 5      1     2     2 6.001  TRUE
 6      1     2     3 5.903  TRUE
 7      1     2     4 5.025  TRUE
 8      1     2     5 3.907 FALSE
 9      1     2     6 4.569 FALSE
 10     5     1     4 6.001  TRUE
 11     5     1     5 4.199 FALSE
 12     5     1     6 4.999 FALSE
 13     5     2     6 5.041  TRUE

问题：

这样做的更好或更直接的方法是什么？因为我将它应用于具有更多嵌套分组变量的非常大的数据集，所以我宁愿不必创建逻辑变量来执行此操作。

示例数据：

 sample.dt <- data.table(id = c(1,1,1,1,1,1,1,1,1,1,1,1,5,5,5,5,5,5,5,5,5,5,5,5),
                    type = c(1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2), 
                    time = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6),
                    value = c(1.002,4.019,5.048,6.005,4.108,3.509,
                              2.104,6.001,5.903,5.025,3.907,4.569,
                              4.006,4.019,4.908,6.001,4.199,4.999,
                              0.009,2.093,3.081,4.014,4.998,5.041))

Answer 1

由于初始数据集为data.table，我们可以使用data.table方法

sample.dt[, .SD[cumsum(value >=5) > 0] , by = .(id, type)]

更快的方法是提取行索引（.I）和子集

sample.dt[sample.dt[, .I[cumsum(value >=5) > 0] , by = .(id, type)]$V1]

Answer 2

> sample.dt$var=ifelse(sample.dt$value>=5,TRUE,FALSE)
> sample.dt
    id type time value   var
 1:  1    1    1 1.002 FALSE
 2:  1    1    2 4.019 FALSE
 3:  1    1    3 5.048  TRUE
 4:  1    1    4 6.005  TRUE
 5:  1    1    5 4.108 FALSE
 6:  1    1    6 3.509 FALSE
 7:  1    2    1 2.104 FALSE
 8:  1    2    2 6.001  TRUE
 9:  1    2    3 5.903  TRUE
10:  1    2    4 5.025  TRUE
11:  1    2    5 3.907 FALSE
12:  1    2    6 4.569 FALSE
13:  5    1    1 4.006 FALSE
14:  5    1    2 4.019 FALSE
15:  5    1    3 4.908 FALSE
16:  5    1    4 6.001  TRUE
17:  5    1    5 4.199 FALSE
18:  5    1    6 4.999 FALSE
19:  5    2    1 0.009 FALSE
20:  5    2    2 2.093 FALSE
21:  5    2    3 3.081 FALSE
22:  5    2    4 4.014 FALSE
23:  5    2    5 4.998 FALSE
24:  5    2    6 5.041  TRUE

> min(which(sample.dt$var== TRUE))
[1] 3


sample.dt[min(which(sample.dt$var== TRUE)),,]


id type time value  var
1:  1    1    3 5.048 TRUE

或只是

> sample.dt[min(which(ifelse(sample.dt$value>=5,TRUE,FALSE)== TRUE)),,]
   id type time value  var
1:  1    1    3 5.048 TRUE

从组的第一个实例中选择行

2 个答案: