对dplyr中的多个变量进行累加和

时间:2019-05-30 20:47:33

标签: r dplyr

我的数据如下:

library(tidyverse)
Date <- c(rep("5/22/19", 3), rep("5/23/19", 3), rep("5/24/19", 3))
Source <- rep(c("Control", "A", "B"), 3)
ValueA <- c(12080, 12012, 11944, 13345, 13342, 13422, 16226, 16045, 16221)
ValueB <- c(11, 9, 13, 11, 9, 7, 12, 9, 15)
df <- tibble(Date, Source, ValueA, ValueB)

df
# A tibble: 9 x 4
  Date    Source  ValueA ValueB
  <chr>   <chr>    <dbl>  <dbl>
1 5/22/19 Control  12080     11
2 5/22/19 A        12012      9
3 5/22/19 B        11944     13
4 5/23/19 Control  13345     11
5 5/23/19 A        13342      9
6 5/23/19 B        13422      7
7 5/24/19 Control  16226     12
8 5/24/19 A        16045      9
9 5/24/19 B        16221     15

我想要的是Date Source的累积总和。因此输出如下所示:

  Date    Source  ValueA ValueB
1 5/22/19 Control  12080     11
2 5/22/19 A        12012      9
3 5/22/19 B        11944     13
4 5/23/19 Control  25425     22
5 5/23/19 A        25354     18
6 5/23/19 B        25366     20
7 5/24/19 Control  41651     34
8 5/24/19 A        41399     27
9 5/24/19 B        41587     35

但是,当我使用此代码时:

df <- df %>%
  group_by(Date, Source) %>%
  summarize(
    ValueA = sum(ValueA, na.rm = TRUE),
    ValueB = sum(ValueB, na.rm = TRUE),
    Cum_A = cumsum(ValueA, na.rm = TRUE),
    Cum_B = cumsum(ValueB, na.rm = TRUE)
  )

我得到了错误

Error in cumsum(ValueA, na.rm = TRUE) : 
  2 arguments passed to 'cumsum' which requires 1

我假设cumsum函数并非旨在处理多个分组变量。那么如何获得想要的结果?

2 个答案:

答案 0 :(得分:1)

我认为您只需要使用group_by(Source)。看看这是否满足您的需求。

一些注意事项:

  • Source保留在arrange()中是可选的;删除它会重新创建您上面要求的数据。我加入了它,因此cumsum()的结果更加明显
  • 鉴于您当前的数据集(无需重复SourceDate),不需要进行汇总,mutate()可以解决问题
  • cumsum()不接受na.rm参数,但您可以用replace_na()代替0
df <- 
  tibble(
    Date = rep(c("5/22/19", "5/23/19", "5/24/19"), each = 3),
    Source = rep(c("Control", "A", "B"), 3), 
    ValueA = c(12080, 12012, 11944, 13345, 13342, 13422, 16226, 16045, 16221), 
    ValueB = c(11, 9, 13, NA, 9, 7, 12, 9, 15)
  )


df %>%  
  arrange(Source, Date) %>% 
  group_by(Source) %>%
  mutate(
    Cum_A = cumsum(replace_na(ValueA, 0)),
    Cum_B = cumsum(replace_na(ValueB, 0))
  ) %>% 
  ungroup()

# Date    Source  ValueA ValueB Cum_A Cum_B
# 5/22/19 A        12012      9 12012     9
# 5/23/19 A        13342      9 25354    18
# 5/24/19 A        16045      9 41399    27
# -----------------------------------------
# 5/22/19 B        11944     13 11944    13
# 5/23/19 B        13422      7 25366    20
# 5/24/19 B        16221     15 41587    35
# -----------------------------------------
# 5/22/19 Control  12080     11 12080    11
# 5/23/19 Control  13345     NA 25425    11
# 5/24/19 Control  16226     12 41651    34

答案 1 :(得分:0)

尝试使用cumsum(来自summarize)来代替tally()dplyr

df = df %>% 
  group_by_(.dots=c("Date","Source","ValueA","ValueB")) %>%
  tally() %>% 
  select(-n)

此方法将按照变量Source的升序对输出进行排序,但是从这一点来看,以您喜欢的任何格式对数据进行排序应该相当简单。