具有多条件复位的R / dplyr累积和

时间:2018-03-02 23:49:02

标签: r dplyr

dplyr / R cumulative sum with reset类似,我想计算一系列('a')的组和cumsum,其复位基于cumsum超过某个阈值或当“当前”观察之间的某个其他阈值并且自复位以来的第一次观察得到满足或超过。处理这些条件中的任何一个都是关键。

例如:

library(dplyr)
library(tibble)
library(purrr)

tib <- tibble(
  t = c(1,2.05,3,3.5,4.5,4.75,7, 7.3),
  a = c(1,1,1,1,2,9,3,1)
)

# what I want
## thresh_a = 4
## thresh_t = 9
# A tibble: 6 x 4
#         t     a     g     c
#      <dbl> <dbl> <int> <dbl>
#   1  1.00  1.00     0  1.00
#   2  2.05  1.00     0  2.00
#   3  3.00  1.00     0  3.00
#   4  3.50  1.00     0  4.00
#   5  4.50  2.00     1  2.00
#   6  4.75  9.00     1 11.00
#   7  7.00  3.00     2  3.00
#   8  7.30  1.00     2  4.00

# what I want
## thresh_a = 4
## thresh_t = 2
# A tibble: 6 x 4
#         t     a     g     c
#      <dbl> <dbl> <int> <dbl>
#   1  1.00  1.00     0  1.00
#   2  2.05  1.00     0  2.00
#   3  3.00  1.00     0  3.00
#   4  3.50  1.00     1  1.00
#   5  4.50  2.00     1  3.00
#   6  4.75  9.00     1 12.00
#   7  7.00  3.00     2  3.00
#   8  7.30  1.00     3  1.00

在第一种情况下,当'a'上的cumsum超过thresh_a时,分组变量('g')会发生变化。在第二种情况下,复位发生在t = 3,因为3-1> = thresh_t。复位发生在t = 4.75,因为累积到超过thresh_a。由于7-4.75&gt; t = 7重置。 thresh_t。实际上,'t'是时间戳。

编辑: 循环版本。寻求更加矢量化,整洁的版本:

sum_cond_reset <- function(thresh_a, thresh_t, a, t) {
  if (length(a) != length(t))
  {
    stop("length of vectors must be equal")
  }

  cumsum <- 0
  grp <- 0
  grp_idx <- 1
  result_cumsum <- numeric()
  result_grp <- numeric()
  for (i in 1:length(a)) {
    cumsum <- cumsum + a[i]
    delta_t <- 0
    if (i > 1) {
      delta_t = t[i] - t[grp_idx]
    }
    if (cumsum >= thresh_a | delta_t >= thresh_t) {
      result_grp <- c(result_grp, grp)
      result_cumsum <- c(result_cumsum, cumsum)
      grp <- grp + 1
      grp_idx <- i
      cumsum <- 0
    } else {
      result_cumsum <- c(result_cumsum, cumsum)
      result_grp <- c(result_grp, grp)
    }

  }
  return(tibble(g = result_grp,
                c = result_cumsum))
}

tib <- tibble(
  t = c(1,2.05,3,3.5,4.5,4.75,7, 7.3),
  a = c(1,1,1,1,2,9,3,1)
)

bind_cols(tib, sum_cond_reset(4,9, tib$a, tib$t) )

bind_cols(tib, sum_cond_reset(4,2, tib$a, tib$t) )

生成

> bind_cols(tib, sum_cond_reset(4,9, tib$a, tib$t) )
# A tibble: 8 x 4
      t     a     g     c
  <dbl> <dbl> <dbl> <dbl>
1  1.00  1.00  0     1.00
2  2.05  1.00  0     2.00
3  3.00  1.00  0     3.00
4  3.50  1.00  0     4.00
5  4.50  2.00  1.00  2.00
6  4.75  9.00  1.00 11.0 
7  7.00  3.00  2.00  3.00
8  7.30  1.00  2.00  4.00
> bind_cols(tib, sum_cond_reset(4,2, tib$a, tib$t) )
# A tibble: 8 x 4
      t     a     g     c
  <dbl> <dbl> <dbl> <dbl>
1  1.00  1.00  0     1.00
2  2.05  1.00  0     2.00
3  3.00  1.00  0     3.00
4  3.50  1.00  1.00  1.00
5  4.50  2.00  1.00  3.00
6  4.75  9.00  1.00 12.0 
7  7.00  3.00  2.00  3.00
8  7.30  1.00  3.00  1.00

0 个答案:

没有答案