Question

虚拟数据集是：

data <- data.frame(
  id = c(1,1,2,2,3,4,5,6),
  value = c(10,10,20,20,10,30,40,50),
  other = c(1,2,3,4,5,6,7,8)
)

数据是group_by(id)管道中dplyr操作的输出。每个id最多与一个值相关联，而两个不同的id可以具有相同的值。我需要通过添加新列找到ID之间的累积总和： cum_col = c(10,10,30,30,40,70,110,160) cumsum中的mutate会在整个值列中找到累积总和，并且不会为每个组选择一个值。 summarise没有用，因为我需要保留完整的其他列。

没有使用summarise然后join向后退出，是否有出路？或者，如果之前已经回答，请指出我链接。

编辑：仅供参考，实际数据有大约200万行和100列。

Answer 1

可以通过 id 列嵌套数据框，计算累积总和，然后不需要：

data %>% 
    group_by(id) %>% nest() %>% 
    mutate(cum_col = cumsum(sapply(data, function(dat) dat$value[1]))) %>% 
    unnest() 

# A tibble: 8 x 4
#     id cum_col value other
#  <dbl>   <dbl> <dbl> <dbl>
#1     1      10    10     1
#2     1      10    10     2
#3     2      30    20     3
#4     2      30    20     4
#5     3      40    10     5
#6     4      70    30     6
#7     5     110    40     7
#8     6     160    50     8

与summarize和join比较：

summarise_f <- function(data) data %>% 
    group_by(id) %>% 
    summarise(val = first(value)) %>%
    mutate(cum_col = cumsum(val)) %>%
    select(-val) %>%
    inner_join(data, by="id")

nest_f <- function(data) data %>% 
    group_by(id) %>% nest() %>% 
    mutate(cum_col = cumsum(sapply(data, function(dat) dat$value[1]))) %>% 
    unnest() 

df <- bind_rows(rep(list(data), 100000))

microbenchmark::microbenchmark(summarise_f(df), nest_f(df))
#Unit: milliseconds
#            expr       min        lq     mean    median        uq      max neval
# summarise_f(df)  79.78891  89.65753 117.8480  93.56766  99.97694 277.3773   100
#      nest_f(df) 191.10597 208.07364 280.2466 225.65567 369.20202 524.5106   100

Summarize然后join实际上更快。

使用更大的数据集：

df <- bind_rows(rep(list(data), 1000000))
microbenchmark::microbenchmark(summarise_f(df), nest_f(df))
#Unit: milliseconds
#            expr       min        lq      mean    median       uq      max neval
# summarise_f(df)  819.5588  905.2136  993.4916  961.1797 1040.947 1480.391   100
#      nest_f(df) 1768.3060 1992.6753 2069.1454 2057.3091 2162.440 2501.715   100

Answer 2

另一种选择是我们创建一个虚拟列（cols），每个组只有value，其余的被0替换，然后我们在整个列上取cumsum。< / p>

library(dplyr)
data %>%
  group_by(id) %>%
  mutate(cols = c(value[1], rep(0, n() -1))) %>%
  ungroup() %>%
  mutate(cum_col = cumsum(cols)) %>%
  select(-cols)


# A tibble: 8 x 4
#     id value other cum_col
#  <dbl> <dbl> <dbl>   <dbl>
#1     1    10     1      10
#2     1    10     2      10
#3     2    20     3      30
#4     2    20     4      30
#5     3    10     5      40
#6     4    30     6      70
#7     5    40     7     110
#8     6    50     8     160

Answer 3

我们也可以使用duplicated

library(dplyr)
data %>%
     mutate(cum_col = cumsum(value*!duplicated(id)))
#  id value other cum_col
#1  1    10     1      10
#2  1    10     2      10
#3  2    20     3      30
#4  2    20     4      30
#5  3    10     5      40
#6  4    30     6      70
#7  5    40     7     110
#8  6    50     8     160

使用dplyr mutate获取唯一值的cumsum

3 个答案: