我有一个data.table说dt
name <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v")
score <- c(42, 82, 43, 32,47,48, 49, 50, 54, 59, 76, 09, 13, 88, 91, 99, 04, 06, 08, 12, 14, 15)
class <- c("c1", "c1", "c1", "c1","c1", "c1", "c1", "c2", "c2", "c2", "c3", "c3", "c3", "c3","c3", "c3", "c3", "c3", "c3", "c3", "c3" ,"c3")
dt <- data.table(name, score, class)
看起来像:
> dt
name score class
1: a 42 c1
2: b 82 c1
3: c 43 c1
4: d 32 c1
5: e 47 c1
6: f 48 c1
7: g 49 c1
8: h 50 c2
9: i 54 c2
10: j 59 c2
11: k 76 c3
12: l 9 c3
13: m 13 c3
14: n 88 c3
15: o 91 c3
16: p 99 c3
17: q 4 c3
18: r 6 c3
19: s 8 c3
20: t 12 c3
21: u 14 c3
22: v 15 c3
我只需要那些遵循每个班级单调的分数顺序的记录。在这种情况下,只有记录得分为42,43,47,48 49为c1类,记录为得分50,54,59为c2类。
在“c3”类记录中得分为76,88,91,99,04,06,08,12,14,15。此处序列已达到最大值(99),然后重新启动。 “c3”类中的得分09和13不符合单调序列,因此需要将其删除。
我想删除那些记录中提到的分数不是c1,c2,c3类的每个记录的记录。总共有100万条记录。
对于给定的班级,最多可以有3个连续的失序分数。
最终输出必须如下。
> dt
name score class
1: a 42 c1
2: c 43 c1
3: e 47 c1
4: f 48 c1
5: g 49 c1
6: h 50 c2
7: i 54 c2
8: j 59 c2
9: k 76 c3
10: n 88 c3
11: o 91 c3
12: p 99 c3
13: q 4 c3
14: r 6 c3
15: s 8 c3
16: t 12 c3
17: u 14 c3
18: v 15 c3
为了找到单调的序列,我试过了:
dt <- dt[, .SD[score == cummax(score)],class]
但这也是删除在达到最大值后重新启动的序列。怎么做。
答案 0 :(得分:3)
cummax
想法非常好 - 您只需要进行一些修改:
dt[, keep := score >= cummax(shift(score, fill = first(score))),
by = .(class, rleid(score == 99))]
或者,或许更好的方法是
dt[dt[, .I[score == cummax(score)], by = list(class, rleid(score == 99))]$V1]