我有一个数据表dt,有三列nm,seqn和obj
> nm <- letters[1:22]
> seqn <- c(32,36, 86,45 , 47, 48, 49,
+ 52, 54, 59,
+ 66, 9, 69, 74, 81, 88, 90, 91, 93, 94, 95, 97)
> obj <- rep(c('c1', 'c2', 'c3'), c(7, 3, 12))
> dt <- data.table(nm, seqn, obj)
> dt
nm seqn obj
1: a 32 c1
2: b 36 c1
3: c 86 c1
4: d 45 c1
5: e 47 c1
6: f 48 c1
7: g 49 c1
8: h 52 c2
9: i 54 c2
10: j 59 c2
11: k 66 c3
12: l 9 c3
13: m 69 c3
14: n 74 c3
15: o 81 c3
16: p 88 c3
17: q 90 c3
18: r 91 c3
19: s 93 c3
20: t 94 c3
21: u 95 c3
22: v 97 c3
我想为每个“obj”组获得一个单调的“seqn”序列。对于obj“c1”(*这里86是一个大数字,而通常的一系列小单调seqn数字),如果obj“c3”,我想删除序列号如86(记录3)想要删除seqn 9.(记录12)(*这里9是一个大数字单调seqn的小数字。)
如何使用data.table / dataframe执行此操作。
答案 0 :(得分:3)
这是另一个data.table
解决方案,与this comment中建议的解决方案不同。
OP要求为每个seqn
组获取单调的obj
序列。此外,OP有detailed,他需要来删除一个更大的数字,当它前面跟着较小的数字时,删除一个较小的数字,当它先于后面跟着更大的数字。虽然没有明确说明,但从提供的数据可以得出结论,OP指的是单调增加的序列。
library(data.table)
DT[-DT[, .I[which(xor(
shift(seqn) < shift(seqn, type = "lead"),
between(seqn, shift(seqn), shift(seqn, type = "lead"))
))], by = obj]$V1]
# nm seqn obj
# 1: a 32 c1
# 2: b 36 c1
# 3: d 45 c1
# 4: e 47 c1
# 5: f 48 c1
# 6: g 49 c1
# 7: h 52 c2
# 8: i 54 c2
# 9: j 59 c2
#10: k 66 c3
#11: m 69 c3
#12: n 74 c3
#13: o 81 c3
#14: p 88 c3
#15: q 90 c3
#16: r 91 c3
#17: s 93 c3
#18: t 94 c3
#19: u 95 c3
#20: v 97 c3
library(data.table)
nm <- letters[1:22]
seqn <- c(32,36, 86,45 , 47, 48, 49, 52, 54, 59,
66, 9, 69, 74, 81, 88, 90, 91, 93, 94, 95, 97)
obj <- rep(c('c1', 'c2', 'c3'), c(7, 3, 12))
DT <- data.table(nm, seqn, obj)
可以增强上述方法,以涵盖在每个序列的开始或结束时违反单调性的边缘情况,即每个obj
组。
例如:
seqn <- c(32,36, 86, 45, 47, -48, 49, 52, 54, 59,
66, 9, 13, 74, 81, 88, 90, 91, 93, 94, 95, 11)
(DT <- data.table(nm, seqn, obj))
# nm seqn obj
# 1: a 32 c1
# 2: b 36 c1
# 3: c 86 c1
# 4: d 45 c1
# 5: e 47 c1
# 6: f 48 c1
# 7: g 49 c1
# 8: h 52 c2
# 9: i 54 c2
#10: j 59 c2
#11: k 66 c3
#12: l 9 c3
#13: m 13 c3
#14: n 74 c3
#15: o 81 c3
#16: p 88 c3
#17: q 90 c3
#18: r 91 c3
#19: s 93 c3
#20: t 94 c3
#21: u 95 c3
#22: v 11 c3
# nm seqn obj
请注意,{13}行已更改DT
。现在,obj
群组c3
的第一个和最后一个元素已成为“异常值”。第一元素66大于接下来的两个元素9和13,并且最后一个元素11低于前一元素95.因此,单调增加的序列以9开始并以95结束,并且元素66和11必须被删除。
这是通过简单地用前导-Inf
和尾随+Inf
填充每个序列来实现的。除了必须将结果移回以选择正确的行号之外,不需要对代码进行其他更改:
DT[-DT[, {seqn <- c(-Inf, seqn, +Inf); .I[which(shift(xor(
shift(seqn) < shift(seqn, type = "lead"),
between(seqn, shift(seqn), shift(seqn, type = "lead"))
), type = "lead"))]}, by = obj]$V1]
# nm seqn obj
# 1: a 32 c1
# 2: b 36 c1
# 3: d 45 c1
# 4: e 47 c1
# 5: f 48 c1
# 6: g 49 c1
# 7: h 52 c2
# 8: i 54 c2
# 9: j 59 c2
#10: l 9 c3
#11: m 13 c3
#12: n 74 c3
#13: o 81 c3
#14: p 88 c3
#15: q 90 c3
#16: r 91 c3
#17: s 93 c3
#18: t 94 c3
#19: u 95 c3