计算R中多列的累积总和

时间:2020-12-29 18:40:30

标签: r

R newb,我正在尝试计算按年、月、组和子组分组的累积总和,还有多列要计算。

数据样本:

               <div class="card">
                            <div class="form-group p-3">
                                <label for="input_hometeam">Home Team</label>
                                <select class="form-control" id="input_hometeam" onchange="document.getElementById('code').innerHTML = this.value">
                                    <?php
                                    while ($row = $teams->fetch(PDO::FETCH_ASSOC)) 
                                    {
                                        echo '<option value="src="data:image/png;base64,'.base64_encode($row['logo']).'">">' . $row['team_name'] . '</option>';
                                    }
                                    ?>
                                </select>
                            </div>
                            <div id="code">
                                <img id="" src="">                          
                            </div>
                        </div>

想要的结果表:

df <- data.frame("Year"=2020,
                "Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
                "Group"=c("A","A","A","B","A","B","B","B"),
                "SubGroup"=c("a","a","b","b","a","b","a","b"),
                "V1"=c(10,10,20,20,50,50,10,10),
                "V2"=c(0,1,2,2,0,5,1,1))
    
       Year Month Group SubGroup V1 V2
    1 2020   Jan     A        a 10  0
    2 2020   Jan     A        a 10  1
    3 2020   Jan     A        b 20  2
    4 2020   Jan     B        b 20  2
    5 2020   Feb     A        a 50  0
    6 2020   Feb     B        b 50  5
    7 2020   Feb     B        a 10  1
    8 2020   Feb     B        b 10  1

从样本表来看,2020 年 1 月,“A”组和“a”组的总和为 10+10 = 20... 2020 年 2 月,该值为 50,因此从 Jan + 50 开始为 20 = 70,并且等等...

如果没有值,则考虑0。

我尝试了一些代码,但没有一个代码甚至没有接近我需要的输出。如果有人能帮助我解决这个问题,我将不胜感激。

3 个答案:

答案 0 :(得分:1)

这是一个简单的 group_by/mutate 问题。选择列 V1, V2 并应用 acrosscumsum

df$Month <- factor(df$Month, levels = c("Jan", "Feb"))

df %>%
  group_by(Year, Group, SubGroup) %>%
  mutate(across(V1:V2, ~cumsum(.x))) %>%
  ungroup() %>%
  arrange(Year, Group, SubGroup, Month)
## A tibble: 8 x 6
#  Year  Month Group SubGroup    V1    V2
#  <chr> <fct> <chr> <chr>    <dbl> <dbl>
#1 2020  Jan   A     a           10     0
#2 2020  Jan   A     a           20     1
#3 2020  Feb   A     a           70     1
#4 2020  Jan   A     b           20     2
#5 2020  Feb   B     a           10     1
#6 2020  Jan   B     b           20     2
#7 2020  Feb   B     b           70     7
#8 2020  Feb   B     b           80     8

答案 1 :(得分:0)

library(dplyr)
library(zoo)

df %>%
  arrange(as.yearmon(paste0(Year, '-', Month), '%Y-%b'), Group, SubGroup) %>%
  group_by(Year, Group, SubGroup) %>% 
  mutate(
         V1 = cumsum(V1),
         V2 = cumsum(V2)
       ) %>% 
  arrange(Year, Group, SubGroup, as.yearmon(paste0(Year, '-', Month), '%Y-%b')) #for desired output ordering

#  A tibble: 8 x 6
#  Groups:   Year, Group, SubGroup [4]
#   Year  Month Group SubGroup    V1    V2
#   <chr> <chr> <chr> <chr>    <dbl> <dbl>
# 1 2020  Jan   A     a           10     0
# 2 2020  Jan   A     a           20     1
# 3 2020  Feb   A     a           70     1
# 4 2020  Jan   A     b           20     2
# 5 2020  Feb   B     a           10     1
# 6 2020  Jan   B     b           20     2
# 7 2020  Feb   B     b           70     7
# 8 2020  Feb   B     b           80     8

答案 2 :(得分:0)

如果我理解你在做什么,你会计算每个月的总和,然后计算这些月的累积总和。这在 dplyr 中通常很容易。

library(dplyr)

df %>% 
  group_by(Year, Month, Group, SubGroup) %>% 
  summarize(
    V1_sum = sum(V1),
    V2_sum = sum(V2)
  ) %>% 
  group_by(Year, Group, SubGroup) %>% 
  mutate(
    V1_cumsum = cumsum(V1_sum),
    V2_cumsum = cumsum(V2_sum)
  )


# A tibble: 6 x 8
# Groups:   Year, Group, SubGroup [4]
#   Year Month Group SubGroup V1_sum V2_sum V1_cumsum V2_cumsum
#   <dbl> <chr> <chr> <chr>     <dbl>  <dbl>     <dbl>     <dbl>
# 1  2020 Feb   A     a            50      0        50         0
# 2  2020 Feb   B     a            10      1        10         1
# 3  2020 Feb   B     b            60      6        60         6
# 4  2020 Jan   A     a            20      1        70         1
# 5  2020 Jan   A     b            20      2        20         2
# 6  2020 Jan   B     b            20      2        80         8

但您会注意到每月的累计金额是倒退的(即一月在二月之后),因为默认情况下 group_by 是按字母顺序分组的。此外,您看不到空值,因为 dplyr 没有填充它们。

要固定月份的顺序,您可以将月份设为数字(转换为日期)或将它们转换为因子。您可以通过在基 R 中使用 aggregate 而不是 dplyr::summarize 来重新添加分组变量的“缺失”组合。 aggregate 包括分组因素的所有组合。 aggregate 将缺失值转换为 NA,但例如,您可以将 NA 替换为 0 和 tidyr::replace_na

library(dplyr)
library(tidyr)

df <- data.frame("Year"=2020,
                 "Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
                 "Group"=c("A","A","A","B","A","B","B","B"),
                 "SubGroup"=c("a","a","b","b","a","b","a","b"),
                 "V1"=c(10,10,20,20,50,50,10,10),
                 "V2"=c(0,1,2,2,0,5,1,1))

df$Month <- factor(df$Month, levels = c("Jan", "Feb"), ordered = TRUE)

# Get monthly sums
df1 <- with(df, aggregate(
  list(V1_sum = V1, V2_sum = V2),
  list(Year = Year, Month = Month, Group = Group, SubGroup = SubGroup),
  FUN = sum, drop = FALSE
))

df1 <- df1 %>% 
  # Replace NA with 0
  mutate(
    V1_sum = replace_na(V1_sum, 0),
    V2_sum = replace_na(V2_sum, 0)
  ) %>% 
  # Get cumulative sum across months
  group_by(Year, Group, SubGroup) %>% 
  mutate(V1cumsum = cumsum(V1_sum), 
         V2cumsum = cumsum(V2_sum)) %>%
  ungroup() %>% 
  select(Year, Month, Group, SubGroup, V1 = V1cumsum, V2 = V2cumsum)

这给出了与您的示例相同的结果:

# # A tibble: 8 x 6
#    Year Month Group SubGroup    V1    V2
#    <dbl> <ord> <chr> <chr>    <dbl> <dbl>
# 1  2020 Jan   A     a           20     1
# 2  2020 Feb   A     a           70     1
# 3  2020 Jan   B     a            0     0
# 4  2020 Feb   B     a           10     1
# 5  2020 Jan   A     b           20     2
# 6  2020 Feb   A     b           20     2
# 7  2020 Jan   B     b           20     2
# 8  2020 Feb   B     b           80     8