根据组R计算数据框中的百分比

时间:2017-11-17 16:03:11

标签: r percentage

我有以下数据框,并希望根据阶段和类别计算百分比。我的一些其他数据有另一个变量,例如年。我需要数据帧上的输出才能使用ggplot2。

Gender = rep(c("Female", "Male"), 6)
Stage = rep(c("Applied", "Appointed", "Interviewed"), each=2, times = 2)
Category = rep(c("Professional", "Research"), each = 6)
Count = as.integer(c("346", "251", "22", "15", "60", "52", "31", "230", "4", "17", "9", "52"))
df = data.frame(Gender, Stage, Category,Count )

我编写的(可怕的)代码适用于某些实例但是如果数据结构发生变化,例如0计数的女性,代码将无效。

totals = aggregate(df$Count, by = list(Stage = df$Stage, Category = df$Category),sum)
totals = rep( totals$x, each = 2)
df$Percentage = round(df$Count/totals, 2)

这是我追求的输出:

   Gender       Stage     Category Count Percentage
1  Female     Applied Professional   346       0.58
2    Male     Applied Professional   251       0.42
3  Female   Appointed Professional    22       0.59
4    Male   Appointed Professional    15       0.41
5  Female Interviewed Professional    60       0.54
6    Male Interviewed Professional    52       0.46
7  Female     Applied     Research    31       0.12
8    Male     Applied     Research   230       0.88
9  Female   Appointed     Research     4       0.19
10   Male   Appointed     Research    17       0.81
11 Female Interviewed     Research     9       0.15
12   Male Interviewed     Research    52       0.85

感谢您的帮助!

3 个答案:

答案 0 :(得分:3)

我们可以使用ave功能:

df$Percentage <- df$Count / ave(df$Count, df$Stage, df$Category, FUN = sum)

   Gender       Stage     Category Count Percentage
1  Female     Applied Professional   346  0.5795645
2    Male     Applied Professional   251  0.4204355
3  Female   Appointed Professional    22  0.5945946
4    Male   Appointed Professional    15  0.4054054
5  Female Interviewed Professional    60  0.5357143
6    Male Interviewed Professional    52  0.4642857
7  Female     Applied     Research    31  0.1187739
8    Male     Applied     Research   230  0.8812261
9  Female   Appointed     Research     4  0.1904762
10   Male   Appointed     Research    17  0.8095238
11 Female Interviewed     Research     9  0.1475410
12   Male Interviewed     Research    52  0.8524590

答案 1 :(得分:2)

我们可以使用dplyr

library(dplyr)
df %>% 
   group_by(Stage, Category) %>%
   mutate(Percentage = round(Count/sum(Count), 2))

答案 2 :(得分:1)

我建议使用data.table包。在那里你可以写出类似的东西:

library(data.table)
dt[,Percentage := round(Count / sum(Count), 2), by=c("Stage", "Category")]

我建议使用data.table包的原因是它是data.frames最快的包之一。一般来说,标准数据框架非常糟糕。

与dplyr相比,data.table更快,但没有透明的SQL数据库接口。

data.table的速度主要通过数据转换中的零复制来实现。

以下是manual