我有以下数据框,并希望根据阶段和类别计算百分比。我的一些其他数据有另一个变量,例如年。我需要数据帧上的输出才能使用ggplot2。
Gender = rep(c("Female", "Male"), 6)
Stage = rep(c("Applied", "Appointed", "Interviewed"), each=2, times = 2)
Category = rep(c("Professional", "Research"), each = 6)
Count = as.integer(c("346", "251", "22", "15", "60", "52", "31", "230", "4", "17", "9", "52"))
df = data.frame(Gender, Stage, Category,Count )
我编写的(可怕的)代码适用于某些实例但是如果数据结构发生变化,例如0计数的女性,代码将无效。
totals = aggregate(df$Count, by = list(Stage = df$Stage, Category = df$Category),sum)
totals = rep( totals$x, each = 2)
df$Percentage = round(df$Count/totals, 2)
这是我追求的输出:
Gender Stage Category Count Percentage
1 Female Applied Professional 346 0.58
2 Male Applied Professional 251 0.42
3 Female Appointed Professional 22 0.59
4 Male Appointed Professional 15 0.41
5 Female Interviewed Professional 60 0.54
6 Male Interviewed Professional 52 0.46
7 Female Applied Research 31 0.12
8 Male Applied Research 230 0.88
9 Female Appointed Research 4 0.19
10 Male Appointed Research 17 0.81
11 Female Interviewed Research 9 0.15
12 Male Interviewed Research 52 0.85
感谢您的帮助!
答案 0 :(得分:3)
我们可以使用ave
功能:
df$Percentage <- df$Count / ave(df$Count, df$Stage, df$Category, FUN = sum)
Gender Stage Category Count Percentage
1 Female Applied Professional 346 0.5795645
2 Male Applied Professional 251 0.4204355
3 Female Appointed Professional 22 0.5945946
4 Male Appointed Professional 15 0.4054054
5 Female Interviewed Professional 60 0.5357143
6 Male Interviewed Professional 52 0.4642857
7 Female Applied Research 31 0.1187739
8 Male Applied Research 230 0.8812261
9 Female Appointed Research 4 0.1904762
10 Male Appointed Research 17 0.8095238
11 Female Interviewed Research 9 0.1475410
12 Male Interviewed Research 52 0.8524590
答案 1 :(得分:2)
我们可以使用dplyr
library(dplyr)
df %>%
group_by(Stage, Category) %>%
mutate(Percentage = round(Count/sum(Count), 2))
答案 2 :(得分:1)
我建议使用data.table包。在那里你可以写出类似的东西:
library(data.table)
dt[,Percentage := round(Count / sum(Count), 2), by=c("Stage", "Category")]
我建议使用data.table包的原因是它是data.frames最快的包之一。一般来说,标准数据框架非常糟糕。
与dplyr相比,data.table更快,但没有透明的SQL数据库接口。
data.table的速度主要通过数据转换中的零复制来实现。
以下是manual