data.table:用于特定列组合的“组计数器”

时间:2016-06-16 16:57:34

标签: r data.table

我想基于一组相同的行在数据框中添加一个计数器列。为此,我使用了包data.table。在我的例子中,行之间的比较需要从列“z”AND(“x”或“y”)的组合中进行。

我测试过:

DF[ , Index := .GRP, by = c("x","y","z") ]

但结果是“z”和“x”与“y”的组合。

如何组合“z”AND(“x”或“y”)?

这是一个数据示例:

DF = data.frame(x=c("a","a","a","b","c","d","e","f","f"), y=c(1,3,2,8,8,4,4,6,0), z=c("M","M","M","F","F","M","M","F","F"))
DF <- data.table(DF)

我想有这个输出:

> DF
   x y z Index
1: a 1 M   1
2: a 3 M   1
3: a 2 M   1
4: b 8 F   2
5: c 8 F   2
6: d 4 M   3
7: e 4 M   3
8: f 6 F   4
9: f 0 F   4

4 个答案:

答案 0 :(得分:6)

如果z的值正在更改 x y的值正在发生变化,则新组启动

试试这个例子。

require(data.table)

DF <- data.table(x = c("a","a","a","b","c","d","e","f","f"),
                 y = c(1,3,2,8,8,4,4,6,0),
                 z=c("M","M","M","F","F","M","M","F","F"))

# The functions to compare if value is not equal with the previous value
is.not.eq.with.lag <- function(x) c(T, tail(x, -1) != head(x, -1))

DF[, x1 := is.not.eq.with.lag(x)]
DF[, y1 := is.not.eq.with.lag(y)]
DF[, z1 := is.not.eq.with.lag(z)]
DF

DF[, Index := cumsum(z1 | (x1 & y1))]
DF

答案 1 :(得分:0)

我知道很多人警告R中的for循环,但在这种情况下,我认为这是解决问题的一种非常直接的方式。此外,结果不会增加,因此性能问题不是一个大问题。 for循环方法是:

    dt$grp <- rep(NA,nrow(dt))
    for (i in 1:nrow(dt)){
        if (i == 1){
          dt$grp[i] = 1
        }
        else {
          if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
            dt$grp[i] = dt$grp[i-1]
          }else{
            dt$grp[i] = dt$grp[i-1] + 1
          }
        }
    }

在OPs原始问题上尝试这个,结果是:

DF = data.frame(x=c("a","a","a","b","c","d","e","f","f"), y=c(1,3,2,8,8,4,4,6,0), z=c("M","M","M","F","F","M","M","F","F"))
dt <- data.table(DF)
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
    if (i == 1){
      dt$grp[i] = 1
    }
    else {
      if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
        dt$grp[i] = dt$grp[i-1]
      }else{
        dt$grp[i] = dt$grp[i-1] + 1
      }
    }
}
dt

   x y z grp
1: a 1 M   1
2: a 3 M   1
3: a 2 M   1
4: b 8 F   2
5: c 8 F   2
6: d 4 M   3
7: e 4 M   3
8: f 6 F   4
9: f 0 F   4

在@ Frank的评论data.table上试一试,也会得到预期结果:

dt<-data.table(x = c("b", "a", "a"), y = c(1, 1, 2), z = c("F", "F", "F"))
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
    if (i == 1){
      dt$grp[i] = 1
    }
    else {
      if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
        dt$grp[i] = dt$grp[i-1]
      }else{
        dt$grp[i] = dt$grp[i-1] + 1
      }
    }
}
dt

   x y z grp
1: b 1 F   1
2: a 1 F   1
3: a 2 F   1

答案 2 :(得分:0)

编辑添加:此解决方案在某些方面是djhurio above所倡导的更详细的版本。我认为这表明发生了什么,所以我会离开它。

我认为如果它被分解一点,这项任务就更容易了。以下代码首先创建两个索引,一个用于x(嵌套在z中)的更改,另一个用于y中的更改(嵌套在z中)。然后我们找到每个索引的第一行。取FIRST.x和FIRST.y为真的情况的累积和应该给出你想要的指数。

library(data.table)

dt_example <- data.table(x = c("a","a","a","b","c","d","e","f","f"),
                         y = c(1,3,2,8,8,4,4,6,0),
                         z = c("M","M","M","F","F","M","M","F","F"))

dt_example[,Index_x := .GRP,by = c("z","x")]
dt_example[,Index_y := .GRP,by = c("z","y")]

dt_example[,FIRST.x := !duplicated(Index_x)]
dt_example[,FIRST.y := !duplicated(Index_y)]

dt_example[,Index := cumsum(FIRST.x & FIRST.y)]
dt_example

   x y z Index_x Index_y FIRST.x FIRST.y Index
1: a 1 M       1       1    TRUE    TRUE     1
2: a 3 M       1       2   FALSE    TRUE     1
3: a 2 M       1       3   FALSE    TRUE     1
4: b 8 F       2       4    TRUE    TRUE     2
5: c 8 F       3       4    TRUE   FALSE     2
6: d 4 M       4       5    TRUE    TRUE     3
7: e 4 M       5       5    TRUE   FALSE     3
8: f 6 F       6       6    TRUE    TRUE     4
9: f 0 F       6       7   FALSE    TRUE     4

答案 3 :(得分:0)

此方法会查找x & z | y & z中的更改。额外的列留在data.table中以显示计算结果。

DF[, c("Ix", "Iy", "Iz", "dx", "dy", "min.change", "Index") := 
     #Create index of values based on consecutive order
     list(ix <- rleid(x), iy <- rleid(y), iz <- rleid(z),
          #Determine if combinations of x+z OR y+z change
          ix1 <- c(0, diff(rleid(ix+iz))), 
          iy1 <- c(0, diff(rleid(iy+iz))),
          #Either combination is constant (no change)?
          change <- pmin(ix1, iy1),
          #New index based on change
          cumsum(change) + 1
          )]

   x y z Ix Iy Iz dx dy min.change Index
1: a 1 M  1  1  1  0  0          0     1
2: a 3 M  1  2  1  0  1          0     1
3: a 2 M  1  3  1  0  1          0     1
4: b 8 F  2  4  2  1  1          1     2
5: c 8 F  3  4  2  1  0          0     2
6: d 4 M  4  5  3  1  1          1     3
7: e 4 M  5  5  3  1  0          0     3
8: f 6 F  6  6  4  1  1          1     4
9: f 0 F  6  7  4  0  1          0     4