通过变量的连续性对数据进行分组

时间:2017-03-15 10:04:12

标签: r data.table

我想根据一个变量的连续性将data.table分成几组。 可以这么说,来自data.table

DT <- data.table(Var1 = c(1:5, 7:10))

我希望它像这样分组:

#    Var1 group
# 1:    1     1 # 1 to 5 is continuous with a maximal difference of 1
# 2:    2     1
# 3:    3     1
# 4:    4     1
# 5:    5     1
# 6:    7     2 # 6 to 10 is continuous again
# 7:    8     2
# 8:    9     2
# 9:   10     2

Var1的差异不应限于此最小示例中的差异,而是可调整的,以便在给出最大差异为2时,DT <- data.table(Var1 = c(seq(1,10, 2), seq(13,30, 2)))也将分为两组。 / p>

编辑: 我应该澄清一个最大的差异&#39; 2或更多的意思是Var1小于2的差异应该被视为连续的&#39;。此外,变量Var1不应限于整数值。最后一件事可以通过乘以例如0.14乘100得到14并且还乘以最大差值&#39; 100。

2 个答案:

答案 0 :(得分:3)

DT[, group := rleid(cumprod(c(1, diff(Var1))))]
#   Var1 group
#1:    1     1
#2:    2     1
#3:    3     1
#4:    4     1
#5:    5     1
#6:    7     2
#7:    8     2
#8:    9     2
#9:   10     2

step <- 2
DT <- data.table(Var1 = c(seq(1,10, 2), seq(13,30, 2)))
DT[, group := rleid(cumsum(c(FALSE, diff(Var1) != step)))]
#    Var1 group
# 1:    1     1
# 2:    3     1
# 3:    5     1
# 4:    7     1
# 5:    9     1
# 6:   13     2
# 7:   15     2
# 8:   17     2
# 9:   19     2
#10:   21     2
#11:   23     2
#12:   25     2
#13:   27     2
#14:   29     2

答案 1 :(得分:1)

基础R解决方案。

foo <- function(x){
 gr <- which(!(duplicated(diff(x)) | duplicated(diff(x), fromLast = T)))
 if(length(gr) == 1){
   cbind(Var1=x,group=rep(1:(length(gr)+1), c(min(gr),length(x)-max(gr))))
 }else{
   cbind(Var1=x,group=rep(1:(length(gr)+1), c(min(gr), diff(gr),length(x)-max(gr))))
 }
}

各种差异都在起作用。

foo(c(seq(1,10, 2), seq(13,30, 2)))
      Var1 group
 [1,]    1     1
 [2,]    3     1
 [3,]    5     1
 [4,]    7     1
 [5,]    9     1
 [6,]   13     2
 [7,]   15     2
 [8,]   17     2
 [9,]   19     2
[10,]   21     2
[11,]   23     2
[12,]   25     2
[13,]   27     2
[14,]   29     2

三个小组也在努力。

foo(c(1:5, 7:10, 13:20))
      Var1 group
 [1,]    1     1
 [2,]    2     1
 [3,]    3     1
 [4,]    4     1
 [5,]    5     1
 [6,]    7     2
 [7,]    8     2
 [8,]    9     2
 [9,]   10     2
[10,]   13     3
[11,]   14     3
[12,]   15     3
[13,]   16     3
[14,]   17     3
[15,]   18     3
[16,]   19     3
[17,]   20     3

对于data.table,您可以尝试:

foo <- function(x){
 gr <- which(!(duplicated(diff(x)) | duplicated(diff(x), fromLast = T)))
 if(length(gr) == 1){
   rep(1:(length(gr)+1), c(min(gr),length(x)-max(gr)))
 }else{
   rep(1:(length(gr)+1), c(min(gr), diff(gr),length(x)-max(gr)))
 }
}
DT[, group := foo(Var1)]