通过组data.table满足条件(包括它)的子集

时间:2017-07-13 09:30:49

标签: r data.table subset

我想通过这样做来对data.table进行子集化:通过对idgroup进行分组,在满足条件时将第1行放到行中。这意味着如果在第3行满足条件,我想保留行1,2和3。

数据示例:

    id time group
 1:  1    0     1
 2:  1   20     1
 3:  1    0     2
 4:  1   40     2
 5:  2    0     1
 6:  2   35     1
 7:  2   50     1
 8:  3    0     1
 9:  3   10     1
10:  3   20     1
11:  3    0     2
12:  3   25     2
13:  3   45     2

条件是:time > 30所以预期结果将是:

    id time group
 1:  1    0     2
 2:  1   40     2
 3:  2    0     1
 4:  2   35     1
 5:  3    0     2
 6:  3   25     2
 7:  3   45     2

我试过了:df[1:which(time >30)[1], .SD, by = .(id, group)]

但它返回:

   id group time
1:  1     1    0
2:  1     1   20
3:  1     2    0
4:  1     2   40

数据:

structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3), 
               time = c(0, 20, 0, 40, 0, 35, 50, 0, 10, 20, 0, 25, 45), 
               group = c(1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2)), .Names = c("id", 
                                                                             "time", "group"), row.names = c(NA, -13L), class = c("data.table", 
                                                                                                                                  "data.frame"))

更新显示akrun使用其他数据集回答的预期行为:

数据:

> dftest
     patientid groupe arret dateConsult lag_dateConsult temps abst temps_cum
1: 0303H233457      2     1  2011-10-05            <NA>     0    1         0
2: 0303H233457      2     1  2011-11-09      2011-10-05    35    1        35
3: 0303H233457      2     1  2011-12-21      2011-11-09    42    1        77
4: 0303H233457      2     1  2012-01-30      2011-12-21    40    1       117
5: 0303H233457      2     1  2012-04-18      2012-01-30    79    1       196
6: 0303H233457      2     1  2012-08-27      2012-04-18   131    1       327
7: 0303H233457      4     1  2012-11-19            <NA>     0    1         0
8: 0303H233457      4     1  2013-01-07      2012-11-19    49    1        49

我得到了什么:

> dftest[dftest[, .I[seq(which(temps_cum > 30))], .(patientid, groupe)]$V1]
     patientid groupe arret dateConsult lag_dateConsult temps abst temps_cum
1: 0303H233457      2     1  2011-10-05            <NA>     0    1         0
2: 0303H233457      2     1  2011-11-09      2011-10-05    35    1        35
3: 0303H233457      2     1  2011-12-21      2011-11-09    42    1        77
4: 0303H233457      2     1  2012-01-30      2011-12-21    40    1       117
5: 0303H233457      2     1  2012-04-18      2012-01-30    79    1       196
6: 0303H233457      4     1  2012-11-19            <NA>     0    1         0
7: 0303H233457      4     1  2013-01-07      2012-11-19    49    1        49

预期结果:

     patientid groupe arret dateConsult lag_dateConsult temps abst temps_cum
1: 0303H233457      2     1  2011-10-05            <NA>     0    1         0
2: 0303H233457      2     1  2011-11-09      2011-10-05    35    1        35
3: 0303H233457      4     1  2012-11-19            <NA>     0    1         0
4: 0303H233457      4     1  2013-01-07      2012-11-19    49    1        49

数据:

structure(list(patientid = c("0303H233457", "0303H233457", "0303H233457", 
                             "0303H233457", "0303H233457", "0303H233457", "0303H233457", "0303H233457"
), groupe = c(2, 2, 2, 2, 2, 2, 4, 4), arret = c(1, 1, 1, 1, 
                                                 1, 1, 1, 1), dateConsult = structure(c(15252, 15287, 15329, 15369, 
                                                                                        15448, 15579, 15663, 15712), class = "Date"), lag_dateConsult = structure(c(NA, 
                                                                                                                                                                    15252, 15287, 15329, 15369, 15448, NA, 15663), class = "Date"), 
temps = c(0, 35, 42, 40, 79, 131, 0, 49), abst = c(1, 1, 
                                                   1, 1, 1, 1, 1, 1), temps_cum = c(0, 35, 77, 117, 196, 327, 
                                                                                    0, 49)), .Names = c("patientid", "groupe", "arret", "dateConsult", 
                                                                                                        "lag_dateConsult", "temps", "abst", "temps_cum"), class = c("data.table", 
                                                                                                                                                                    "data.frame"), row.names = c(NA, -8L))

1 个答案:

答案 0 :(得分:1)

按&#39; id&#39;,&#39; group&#39;进行分组后,获取行索引&#39; time&#39;大于30,并将行子集

df1[df1[, .I[seq(which(time > 30))], .(id, group)]$V1]

如果我们还需要到最后一行的时间&#39;大于30

df1[df1[, .I[seq(tail(which(time > 30), 1))], .(id, group)]$V1]
#   id time group
#1:  1    0     2
#2:  1   40     2
#3:  2    0     1
#4:  2   35     1
#5:  2   50     1
#6:  3    0     2
#7:  3   25     2
#8:  3   45     2