我想根据一个变量的连续性将data.table
分成几组。
可以这么说,来自data.table
:
DT <- data.table(Var1 = c(1:5, 7:10))
我希望它像这样分组:
# Var1 group
# 1: 1 1 # 1 to 5 is continuous with a maximal difference of 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 7 2 # 6 to 10 is continuous again
# 7: 8 2
# 8: 9 2
# 9: 10 2
Var1
的差异不应限于此最小示例中的差异,而是可调整的,以便在给出最大差异为2时,DT <- data.table(Var1 = c(seq(1,10, 2), seq(13,30, 2)))
也将分为两组。 / p>
编辑:
我应该澄清一个最大的差异&#39; 2或更多的意思是Var1
小于2的差异应该被视为连续的&#39;。此外,变量Var1
不应限于整数值。最后一件事可以通过乘以例如0.14乘100得到14并且还乘以最大差值&#39; 100。
答案 0 :(得分:3)
DT[, group := rleid(cumprod(c(1, diff(Var1))))]
# Var1 group
#1: 1 1
#2: 2 1
#3: 3 1
#4: 4 1
#5: 5 1
#6: 7 2
#7: 8 2
#8: 9 2
#9: 10 2
step <- 2
DT <- data.table(Var1 = c(seq(1,10, 2), seq(13,30, 2)))
DT[, group := rleid(cumsum(c(FALSE, diff(Var1) != step)))]
# Var1 group
# 1: 1 1
# 2: 3 1
# 3: 5 1
# 4: 7 1
# 5: 9 1
# 6: 13 2
# 7: 15 2
# 8: 17 2
# 9: 19 2
#10: 21 2
#11: 23 2
#12: 25 2
#13: 27 2
#14: 29 2
答案 1 :(得分:1)
基础R解决方案。
foo <- function(x){
gr <- which(!(duplicated(diff(x)) | duplicated(diff(x), fromLast = T)))
if(length(gr) == 1){
cbind(Var1=x,group=rep(1:(length(gr)+1), c(min(gr),length(x)-max(gr))))
}else{
cbind(Var1=x,group=rep(1:(length(gr)+1), c(min(gr), diff(gr),length(x)-max(gr))))
}
}
各种差异都在起作用。
foo(c(seq(1,10, 2), seq(13,30, 2)))
Var1 group
[1,] 1 1
[2,] 3 1
[3,] 5 1
[4,] 7 1
[5,] 9 1
[6,] 13 2
[7,] 15 2
[8,] 17 2
[9,] 19 2
[10,] 21 2
[11,] 23 2
[12,] 25 2
[13,] 27 2
[14,] 29 2
三个小组也在努力。
foo(c(1:5, 7:10, 13:20))
Var1 group
[1,] 1 1
[2,] 2 1
[3,] 3 1
[4,] 4 1
[5,] 5 1
[6,] 7 2
[7,] 8 2
[8,] 9 2
[9,] 10 2
[10,] 13 3
[11,] 14 3
[12,] 15 3
[13,] 16 3
[14,] 17 3
[15,] 18 3
[16,] 19 3
[17,] 20 3
对于data.table
,您可以尝试:
foo <- function(x){
gr <- which(!(duplicated(diff(x)) | duplicated(diff(x), fromLast = T)))
if(length(gr) == 1){
rep(1:(length(gr)+1), c(min(gr),length(x)-max(gr)))
}else{
rep(1:(length(gr)+1), c(min(gr), diff(gr),length(x)-max(gr)))
}
}
DT[, group := foo(Var1)]