使用dplyr计算多个变量分组时变量的比例

时间:2017-01-03 12:00:55

标签: r dplyr grouping

我有一个包含列组,帐户和持续时间的tibble,每行代表1个事件。我想做一个很好的汇总表,其中包括集团,账户,总计持续时间,计算价格以及最终总持续时间的组比例。

可重复的样本:

library(tidyverse)
library(lubridate)
tidy_data <- structure(list(group = c("Group 1", "Group 2", "Group 3", "Group 1", "Group 2", "Group 3", "Group 4", "Group 4", "Group 2"), account = c("Account 1", "Account 2","Account 3", "Account 1", "Account 2", "Account 3", "Account 4", "Account 4", "Account 2"), duration = structure(c(146.15, 181.416666666667, 96.9, 52.2833333333333, 99.4333333333333, 334.116666666667, 16.6333333333333, 11.5666666666667, 79.5666666666667), units = "mins", class = "difftime")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), .Names = c("group","account", "duration"))
hourPrice = 25

摘要1 - 正确计算比例,但不包括帐号

tidy_data %>% 
    group_by(group) %>%
    summarise(total = sum(duration) %>% time_length(unit = "hour") %>% round(digits = 2),
                        price = (total*hourPrice) %>% round(digits = 0)) %>%
    mutate(prop = (price / sum(price) * 100) %>% round(digits = 0))

# A tibble: 4 × 4
    group total price  prop
    <chr> <dbl> <dbl> <dbl>
1 Group 1  3.31    83    20
2 Group 2  6.01   150    35
3 Group 3  7.18   180    42
4 Group 4  0.47    12     3

摘要2 - 包括帐号,但无法正确计算比例

tidy_data %>% 
    group_by(group, account) %>%
    summarise(total = sum(duration) %>% time_length(unit = "hour") %>% round(digits = 2),
                        price = (total*hourPrice) %>% round(digits = 0)) %>%
    mutate(prop = (price / sum(price) * 100) %>% round(digits = 0))

#Source: local data frame [4 x 5]
#Groups: group [4]

    group   account total price  prop
    <chr>     <chr> <dbl> <dbl> <dbl>
1 Group 1 Account 1  3.31    83   100
2 Group 2 Account 2  6.01   150   100
3 Group 3 Account 3  7.18   180   100
4 Group 4 Account 4  0.47    12   100

我意识到问题在于,由于这两个分组,在第二种情况下仅汇总一组内的工作。我考虑做了摘要1,然后将帐号加入到表中,但在我看来,必须有一个更好的解决方案。

编辑:我想要的输出:

    group   account total price  prop
    <chr>     <chr> <dbl> <dbl> <dbl>
1 Group 1 Account 1  3.31    83    20
2 Group 2 Account 2  6.01   150    35
3 Group 3 Account 3  7.18   180    42
4 Group 4 Account 4  0.47    12     3

1 个答案:

答案 0 :(得分:0)

我们使用summarise而不是mutate来创建数据集中的新列,然后使用slice每个&#39;组的第一行,计算& #39;丙&#39;并删除&#39;持续时间&#39;专栏

tidy_data %>% 
      group_by(group) %>%
      mutate(total = sum(duration) %>% 
                time_length(unit = "hour") %>%
                round(digits = 2), 
              price = (total*hourPrice) %>% 
                 round(digits = 0)) %>% 
      slice(1L) %>% 
      ungroup() %>%
      mutate(prop = (price / sum(price) * 100) %>% 
           round(digits = 0)) %>%
      select(-duration)     
# A tibble: 4 × 5
#     group   account total price  prop
#     <chr>     <chr> <dbl> <dbl> <dbl>
# 1 Group 1 Account 1  3.31    83    20
# 2 Group 2 Account 2  6.01   150    35
# 3 Group 3 Account 3  7.18   180    42
# 4 Group 4 Account 4  0.47    12     3