查找一列的累积总和,直到满足另一列的条件总和

时间:2017-07-01 19:12:13

标签: r cumsum

我想为列B的那些行找到前面的cumsum(即cumsum减去当前行),直到包含当前行的A列的前一行的总和为< = 7。

我能够使用传统的for循环找到答案。矢量化实现非常有用,因为我需要在大型数据集上运行它。分享我的简单代码以防万一。

dt <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
                 B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
                 Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6),
                 new=rep(0,11))



dt3 <- dt
for (i in 2:nrow(dt3)){
  set<-0
  count<-0
  k=i-1
  for (j in k:1){
    count=count+dt3$A[j+1]
    if(count<=7){ 
      set<-set+dt3$B[j]
      if(j==1){
        dt3$new[i]=set
      }
    }
    else{
      dt3$new[i]=set
    }
  }
}

以下是满足的3个条件:

  1. 如果A&gt; 7,然后Ans重置为0
  2. 如果cumsum(A)&lt; = 7,那么Ans就是lagB的cumsum()
  3. 如果cumsum(A)&gt; 7,然后Ans是滞后B的cumsum(),对于A的前一行的范围,其总和<= 7
  4. 这是数据的简化版本(A列和B列),所需的输出是Column Ans:

    dt <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
                     B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
                     Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6))
    
    dt
        A B Ans   Reason for value in Ans:
    1   0 1   0       There are no preceeding rows in B so Ans is 0
    2   2 0   1       Sum of value of A from row 2 to 1 is 2 <=7. So Ans is the value of B from first row = 1
    3   3 4   1       Sum of value of A from row 3,2 and 1 is 5 <=7. So Ans is the sum of value of B in row 1 and 2, which is 1. 
    4   5 2   4       Value of A from row 4 is 5 which is <=7. So Ans is value of B from row 3, which is 4
    5   8 3   0       Value of A in row 5 is 8 which is >7. So Ans is 0 (Value of Ans resets to 0 when A > 7).
    6  90 4   0
    7   8 2   0
    8   2 1   2        Value of A in row 8 is 2 which <=7, so Ans is value of B in row 7 which is 2
    9   4 2   3        Sum of value of A from row 9 and 8 is 6<=7, so Ans is sum of value of B in row 8 and 7 = 3
    10  1 3   5        Sum of value of A from row 10,9 and 8 is 7<=7, so Ans is sum of value of B in row 9,8 and 7 =5.
    11  2 1   6        Sum of value of A from row 11,10 and 9 is 7<=7, so Ans is sum of value of B in row 10,9 and 8 =6. 
    

    有关如何在R中编码的任何帮助?

1 个答案:

答案 0 :(得分:2)

请参阅下面的编辑,它会尝试回答更新的问题。

如果我理解OP的意图是正确的,那么有3条规则:

  1. 如果A大于7,那么Ans为零并重新启动分组
  2. 如果群组中的cumsum(A)小于或等于7,那么Ans是滞后cumsum()的{​​{1}}
  3. 如果群组中的B大于7,那么cumsum(A)会滞后Ans
  4. 以下代码生成给定样本数据集的预期结果:

    B
    # create sample data set
    DF <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1),
                     B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3),
                     Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5))
    # load data.table, CRAN version 1.10.4 used
    library(data.table)
    # coerce to data.table
    DT <- data.table(DF)
    # create helper column with lagged values of
    DT[, lagB := shift(B, fill = 0)][]
    # create new answer
    DT[, new := (A <= 7) * ifelse(cumsum(A) <= 7, cumsum(lagB), lagB), by = rleid(A <= 7)][
      , lagB := NULL][]
    

    A B Ans new 1: 0 1 0 0 2: 2 0 1 1 3: 3 4 1 1 4: 5 2 4 4 5: 8 3 0 0 6: 90 4 0 0 7: 8 2 0 0 8: 2 1 2 2 9: 4 2 3 3 10: 1 3 5 5 rleid(A <= 7)值不大于或等于7的所有连续条纹创建唯一的组编号。 A子句实现分组中的规则2和3。通过将结果乘以ifelse(),实现了规则1.从而使用(A <= 7)为1且as.numeric(TRUE)为0的技巧。最后,删除辅助列。

    修改

    根据OP提供的其他信息,我相信只剩下一条规则:

      每行
    • 找到一个向后延伸的窗口,其中包含as.numeric(FALSE)不超过7的行数。答案是同一窗口中滞后sum(A)的总和。
    • 为了澄清,如果窗口的长度为零,因为初始行中的B已超过7,则答案为零。

    滑动窗口的可变长度是一个棘手的部分:

    A
    # sample data set consists of 11 rows after OP's edit
    DF <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
                     B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
                     Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6))
    DT <- data.table(DF) 
    DT[, lagB := shift(B, fill = 0)][]
    
    # find window lengths
    DT[, wl := DT[, Reduce(`+`, shift(A, 0:6, fill = 0), accumulate = TRUE)][, rn := .I][
      , Position(function(x) x <= 7, right = TRUE, unlist(.SD)), by = rn]$V1][]
    
    # sum lagged B in respective window
    DT[, new := DT[, Reduce(`+`, shift(lagB, 0:6, fill = 0), accumulate = TRUE)][
      , rn := .I][, wl := DT$wl][, ifelse(is.na(wl), 0, unlist(.SD)[wl]), by = rn]$V1][]