我经常使用函数group_by()
和summarize()
(注意:如果摘要统计信息为count()
),sum()
函数与dplyr
函数相同在R
中打包。
以下是一个示例:
library(dplyr)
data <- data.frame(
group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
var1 = sample(1:16)
)
这是输出:
out_df <-
data %>%
group_by(group) %>%
summarize(sum_var1 = sum(var1))
print(out_df)
Source: local data frame [7 x 3]
Groups: group [4]
group factor sum_var1
<fctr> <fctr> <int>
1 Group A Factor 1 29
2 Group B Factor 1 8
3 Group C Factor 1 33
4 Group D Factor 1 12
5 Group A Factor 2 27
6 Group B Factor 2 10
7 Group C Factor 2 17
现在,我多次想要找出每个sum_var1
变量的比例,不是总和的比例,而是作为一个等级的总和的一部分因素,例如factor
变量。
我通常通过查找因子的每个级别的总和,然后手动将观察值除以它来做到这一点,如下所示:
out_df %>% group_by(factor) %>% summarize(factor_sum = sum(sum_var1))
to_divide <- (c(rep(82, 4), rep(54, 4)))
out_df$factor_prop_sum_var1 <- out_df$sum_var1 / to_divide
这会产生所需的输出,我可以检查sum
的{{1}}是否等于factor_prop_sum_var1
:
1
这样可行,但它充其量只是非常笨重。有没有办法更好地做到这一点,呃,优雅,(最好是在out_df
Source: local data frame [8 x 4]
Groups: group [4]
group factor sum_var1 factor_prop_sum_var1
<fctr> <fctr> <int> <dbl>
1 Group A Factor 1 26 0.3170732
2 Group B Factor 1 17 0.2073171
3 Group C Factor 1 19 0.2317073
4 Group D Factor 1 18 0.2195122
5 Group A Factor 2 8 0.1481481
6 Group B Factor 2 19 0.3518519
7 Group C Factor 2 7 0.1296296
8 Group D Factor 2 22 0.4074074
out_df %>% group_by(factor) %>% summarize(checking = sum(factor_prop_sum_var1))
# A tibble: 2 × 2
factor checking
<fctr> <dbl>
1 Factor 1 1
2 Factor 2 1
“管道”)?
答案 0 :(得分:4)
要获得组内的比例,只需按照您希望比例添加到100%的列进行分组。因此,在这种情况下,在获得group
和factor
的每个组合的总和后,再次使用group_by
,但此时间组仅由factor
组成,然后计算百分比。
library(dplyr)
set.seed(100)
data <- data.frame(
group = sample(rep(c("Group A", "Group B", "Group C", "Group D"), 4), 16, replace = F),
factor = sample(rep(c("Factor 1", "Factor 2"), 8), 16, replace = F),
var1 = sample(1:16)
)
data %>%
group_by(group, factor) %>%
summarize(sum_var1 = sum(var1)) %>%
group_by(factor) %>%
mutate(percent = sum_var1/sum(sum_var1)) %>%
arrange(factor)
group factor sum_var1 percent 1 Group A Factor 1 13 0.25000000 2 Group B Factor 1 8 0.15384615 3 Group C Factor 1 21 0.40384615 4 Group D Factor 1 10 0.19230769 5 Group A Factor 2 20 0.23809524 6 Group B Factor 2 27 0.32142857 7 Group C Factor 2 2 0.02380952 8 Group D Factor 2 35 0.41666667