我想通过这样做来对data.table进行子集化:通过对id
和group
进行分组,在满足条件时将第1行放到行中。这意味着如果在第3行满足条件,我想保留行1,2和3。
数据示例:
id time group
1: 1 0 1
2: 1 20 1
3: 1 0 2
4: 1 40 2
5: 2 0 1
6: 2 35 1
7: 2 50 1
8: 3 0 1
9: 3 10 1
10: 3 20 1
11: 3 0 2
12: 3 25 2
13: 3 45 2
条件是:time > 30
所以预期结果将是:
id time group
1: 1 0 2
2: 1 40 2
3: 2 0 1
4: 2 35 1
5: 3 0 2
6: 3 25 2
7: 3 45 2
我试过了:df[1:which(time >30)[1], .SD, by = .(id, group)]
但它返回:
id group time
1: 1 1 0
2: 1 1 20
3: 1 2 0
4: 1 2 40
数据:
structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3),
time = c(0, 20, 0, 40, 0, 35, 50, 0, 10, 20, 0, 25, 45),
group = c(1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2)), .Names = c("id",
"time", "group"), row.names = c(NA, -13L), class = c("data.table",
"data.frame"))
更新显示akrun使用其他数据集回答的预期行为:
数据:
> dftest
patientid groupe arret dateConsult lag_dateConsult temps abst temps_cum
1: 0303H233457 2 1 2011-10-05 <NA> 0 1 0
2: 0303H233457 2 1 2011-11-09 2011-10-05 35 1 35
3: 0303H233457 2 1 2011-12-21 2011-11-09 42 1 77
4: 0303H233457 2 1 2012-01-30 2011-12-21 40 1 117
5: 0303H233457 2 1 2012-04-18 2012-01-30 79 1 196
6: 0303H233457 2 1 2012-08-27 2012-04-18 131 1 327
7: 0303H233457 4 1 2012-11-19 <NA> 0 1 0
8: 0303H233457 4 1 2013-01-07 2012-11-19 49 1 49
我得到了什么:
> dftest[dftest[, .I[seq(which(temps_cum > 30))], .(patientid, groupe)]$V1]
patientid groupe arret dateConsult lag_dateConsult temps abst temps_cum
1: 0303H233457 2 1 2011-10-05 <NA> 0 1 0
2: 0303H233457 2 1 2011-11-09 2011-10-05 35 1 35
3: 0303H233457 2 1 2011-12-21 2011-11-09 42 1 77
4: 0303H233457 2 1 2012-01-30 2011-12-21 40 1 117
5: 0303H233457 2 1 2012-04-18 2012-01-30 79 1 196
6: 0303H233457 4 1 2012-11-19 <NA> 0 1 0
7: 0303H233457 4 1 2013-01-07 2012-11-19 49 1 49
预期结果:
patientid groupe arret dateConsult lag_dateConsult temps abst temps_cum
1: 0303H233457 2 1 2011-10-05 <NA> 0 1 0
2: 0303H233457 2 1 2011-11-09 2011-10-05 35 1 35
3: 0303H233457 4 1 2012-11-19 <NA> 0 1 0
4: 0303H233457 4 1 2013-01-07 2012-11-19 49 1 49
数据:
structure(list(patientid = c("0303H233457", "0303H233457", "0303H233457",
"0303H233457", "0303H233457", "0303H233457", "0303H233457", "0303H233457"
), groupe = c(2, 2, 2, 2, 2, 2, 4, 4), arret = c(1, 1, 1, 1,
1, 1, 1, 1), dateConsult = structure(c(15252, 15287, 15329, 15369,
15448, 15579, 15663, 15712), class = "Date"), lag_dateConsult = structure(c(NA,
15252, 15287, 15329, 15369, 15448, NA, 15663), class = "Date"),
temps = c(0, 35, 42, 40, 79, 131, 0, 49), abst = c(1, 1,
1, 1, 1, 1, 1, 1), temps_cum = c(0, 35, 77, 117, 196, 327,
0, 49)), .Names = c("patientid", "groupe", "arret", "dateConsult",
"lag_dateConsult", "temps", "abst", "temps_cum"), class = c("data.table",
"data.frame"), row.names = c(NA, -8L))
答案 0 :(得分:1)
按&#39; id&#39;,&#39; group&#39;进行分组后,获取行索引&#39; time&#39;大于30,并将行子集
df1[df1[, .I[seq(which(time > 30))], .(id, group)]$V1]
如果我们还需要到最后一行的时间&#39;大于30
df1[df1[, .I[seq(tail(which(time > 30), 1))], .(id, group)]$V1]
# id time group
#1: 1 0 2
#2: 1 40 2
#3: 2 0 1
#4: 2 35 1
#5: 2 50 1
#6: 3 0 2
#7: 3 25 2
#8: 3 45 2