所以我有以下数据集:
Employee EducLev JobGrade YrsExper Age Gender YrsPrior PCJob Salary
1 3 1 3 26 Male 1 No 32000
2 1 1 14 38 Female 1 No 39100
3 1 1 12 35 Female 0 No 33200
4 2 1 8 40 Female 7 No 30600
5 3 1 3 28 Male 0 No 29000
6 3 2 3 24 Female 0 No 30500
7 3 2 4 27 Female 0 No 30000
8 3 2 8 33 Male 2 No 27000
9 1 3 4 62 Female 0 No 34000
10 3 3 9 31 Female 0 No 29500
11 3 4 9 34 Female 2 No 26800
12 2 5 8 37 Female 8 No 31300
13 2 5 9 37 Female 0 No 31200
14 2 6 10 58 Female 6 No 34700
15 3 6 4 33 Female 0 No 30000
16 3 6 3 27 Female 0 No 31000
我需要这样的输出:
JobGrade Female Male Total
1 34.29% 17.65% 28.85%
2 20.71% 19.12% 20.19%
3 25.71% 10.29% 20.67%
4 12.14% 16.18% 13.46%
5 6.43% 17.65% 10.10%
6 0.71% 19.12% 6.73%
我查看了其他一些使用聚合函数的帖子。我无法在这种情况下使用它。任何人都可以帮助我如何获得这样的输出? P.S:我不想通过计算所有百分比然后创建新数据集来做到这一点。
我曾使用以下代码自行解决问题。但我不认为这是解决这个问题的正确方法。
df = data.frame(jobgrade=numeric(), gmale=numeric(), gfemale=numeric(), total=numeric())
for(i in 1:6)
{
df[i,]=c(i, nrow(bsal[bsal$Gender=="Male"&bsal$JobGrade==i,]) * 100 / nrow(bsal[bsal$JobGrade==i,]),
nrow(bsal[bsal$Gender=="Female"& bsal$JobGrade==i,]) * 100 / nrow(bsal[bsal$JobGrade==i,]),
nrow(bsal[bsal$JobGrade==i,]) * 100/nrow(bsal))
}
答案 0 :(得分:4)
您可以使用aggregate
完成此操作。让我们说你的data.frame被命名为df。这个方法首先创建一个填充的列,我将其命名为dumm。您可以避免此步骤并在之后执行。
df$dumm <- 1
results <- aggregate(cbind("Female"=df$Gender == "Female",
"Male"=df$Gender == "Male",
"total"=df$dumm),
by=list(df$JobGrade), fun=sum)
结果data.frame包含按工作等级划分的男性,女性和总数。现在只需除以总和:
results <- results / sum(results$total)
第二种非常常见的方法是使用data.table
包:
library(data.table)
setDT(df)
results <- df[, list("Female"=sum(Gender == "Female"),
"Male"=sum(Gender == "Male"),
"total"=length(Gender)),
by=.(JobGrade)]
results <- df[, lapply(.SD, function(i) i / sum(total)), .SDcols=2:4]
答案 1 :(得分:0)
以下是data.table
使用dcast
的另一个选项。我们将'data.frame'转换为'data.table'(setDT(df1)
),转换为'wide'格式,将fun.aggregate
指定为length
,join
并汇总按'JobGrade'数据集on
'JobGrade'计算,将列2:4分配(:=
)到通过除以'{1}}'总计'获得的输出。
sum
这也可以使用library(data.table)
dcast(setDT(df1), JobGrade~Gender, value.var= "Gender", length)[df1[
, .(Total=.N) ,.(JobGrade)], on = "JobGrade"][, (2:4) := lapply(.SD, `/`,
sum(Total)), .SDcols = 2:4][]
紧凑选项
base R