使用R中的dplyr查找分组观察的比例

时间:2016-11-15 22:52:56

标签: r dplyr

我经常使用函数group_by()summarize()(注意:如果摘要统计信息为count()),sum()函数与dplyr函数相同在R中打包。

以下是一个示例:

library(dplyr)

data <- data.frame(
  group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
  factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
  var1 = sample(1:16)
)

这是输出:

out_df <- 
    data %>% 
        group_by(group) %>% 
        summarize(sum_var1 = sum(var1))

print(out_df)

Source: local data frame [7 x 3]
Groups: group [4]

    group   factor sum_var1
   <fctr>   <fctr>    <int>
1 Group A Factor 1       29
2 Group B Factor 1        8
3 Group C Factor 1       33
4 Group D Factor 1       12
5 Group A Factor 2       27
6 Group B Factor 2       10
7 Group C Factor 2       17

现在,我多次想要找出每个sum_var1变量的比例,不是总和的比例,而是作为一个等级的总和的一部分因素,例如factor变量。

我通常通过查找因子的每个级别的总和,然后手动将观察值除以它来做到这一点,如下所示:

out_df %>% group_by(factor) %>% summarize(factor_sum = sum(sum_var1))
to_divide <- (c(rep(82, 4), rep(54, 4)))
out_df$factor_prop_sum_var1 <- out_df$sum_var1 / to_divide

这会产生所需的输出,我可以检查sum的{​​{1}}是否等于factor_prop_sum_var1

1

这样可行,但它充其量只是非常笨重。有没有办法更好地做到这一点,呃,优雅,(最好是在out_df Source: local data frame [8 x 4] Groups: group [4] group factor sum_var1 factor_prop_sum_var1 <fctr> <fctr> <int> <dbl> 1 Group A Factor 1 26 0.3170732 2 Group B Factor 1 17 0.2073171 3 Group C Factor 1 19 0.2317073 4 Group D Factor 1 18 0.2195122 5 Group A Factor 2 8 0.1481481 6 Group B Factor 2 19 0.3518519 7 Group C Factor 2 7 0.1296296 8 Group D Factor 2 22 0.4074074 out_df %>% group_by(factor) %>% summarize(checking = sum(factor_prop_sum_var1)) # A tibble: 2 × 2 factor checking <fctr> <dbl> 1 Factor 1 1 2 Factor 2 1 “管道”)?

1 个答案:

答案 0 :(得分:4)

要获得组内的比例,只需按照您希望比例添加到100%的列进行分组。因此,在这种情况下,在获得groupfactor的每个组合的总和后,再次使用group_by,但此时间组仅由factor组成,然后计算百分比。

library(dplyr)

set.seed(100)
data <- data.frame(
  group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
  factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
  var1 = sample(1:16)
)

data %>% 
  group_by(group, factor) %>% 
  summarize(sum_var1 = sum(var1)) %>%
  group_by(factor) %>%
  mutate(percent = sum_var1/sum(sum_var1)) %>%
  arrange(factor)
    group   factor sum_var1    percent
1 Group A Factor 1       13 0.25000000
2 Group B Factor 1        8 0.15384615
3 Group C Factor 1       21 0.40384615
4 Group D Factor 1       10 0.19230769
5 Group A Factor 2       20 0.23809524
6 Group B Factor 2       27 0.32142857
7 Group C Factor 2        2 0.02380952
8 Group D Factor 2       35 0.41666667