需要有关数据框中高级别分组和数据操作的帮助

时间:2016-06-05 16:11:37

标签: r dataframe

所以我有以下数据集:

Employee    EducLev JobGrade    YrsExper    Age Gender  YrsPrior    PCJob   Salary
1   3   1   3   26  Male    1   No  32000
2   1   1   14  38  Female  1   No  39100
3   1   1   12  35  Female  0   No  33200
4   2   1   8   40  Female  7   No  30600
5   3   1   3   28  Male    0   No  29000
6   3   2   3   24  Female  0   No  30500
7   3   2   4   27  Female  0   No  30000
8   3   2   8   33  Male    2   No  27000
9   1   3   4   62  Female  0   No  34000
10  3   3   9   31  Female  0   No  29500
11  3   4   9   34  Female  2   No  26800
12  2   5   8   37  Female  8   No  31300
13  2   5   9   37  Female  0   No  31200
14  2   6   10  58  Female  6   No  34700
15  3   6   4   33  Female  0   No  30000
16  3   6   3   27  Female  0   No  31000

我需要这样的输出:

JobGrade    Female  Male    Total
1            34.29% 17.65%  28.85%
2            20.71% 19.12%  20.19%
3            25.71% 10.29%  20.67%
4            12.14% 16.18%  13.46%
5            6.43%  17.65%  10.10%
6            0.71%  19.12%  6.73%

我查看了其他一些使用聚合函数的帖子。我无法在这种情况下使用它。任何人都可以帮助我如何获得这样的输出? P.S:我不想通过计算所有百分比然后创建新数据集来做到这一点。

我曾使用以下代码自行解决问题。但我不认为这是解决这个问题的正确方法。

df = data.frame(jobgrade=numeric(), gmale=numeric(), gfemale=numeric(), total=numeric())

for(i in 1:6)
{
 df[i,]=c(i, nrow(bsal[bsal$Gender=="Male"&bsal$JobGrade==i,]) * 100 / nrow(bsal[bsal$JobGrade==i,]), 
          nrow(bsal[bsal$Gender=="Female"& bsal$JobGrade==i,]) * 100 / nrow(bsal[bsal$JobGrade==i,]),
          nrow(bsal[bsal$JobGrade==i,]) * 100/nrow(bsal))
}

2 个答案:

答案 0 :(得分:4)

您可以使用aggregate完成此操作。让我们说你的data.frame被命名为df。这个方法首先创建一个填充的列,我将其命名为dumm。您可以避免此步骤并在之后执行。

df$dumm <- 1
results <- aggregate(cbind("Female"=df$Gender == "Female", 
                           "Male"=df$Gender == "Male",
                           "total"=df$dumm), 
                    by=list(df$JobGrade), fun=sum)

结果data.frame包含按工作等级划分的男性,女性和总数。现在只需除以总和:

results <- results / sum(results$total)

第二种非常常见的方法是使用data.table包:

library(data.table)
setDT(df)

results <- df[, list("Female"=sum(Gender == "Female"), 
                               "Male"=sum(Gender == "Male"),
                               "total"=length(Gender)), 
              by=.(JobGrade)]
results <- df[, lapply(.SD, function(i) i / sum(total)), .SDcols=2:4]

答案 1 :(得分:0)

以下是data.table使用dcast的另一个选项。我们将'data.frame'转换为'data.table'(setDT(df1)),转换为'wide'格式,将fun.aggregate指定为lengthjoin并汇总按'JobGrade'数据集on'JobGrade'计算,将列2:4分配(:=)到通过除以'{1}}'总计'获得的输出。

sum

这也可以使用library(data.table) dcast(setDT(df1), JobGrade~Gender, value.var= "Gender", length)[df1[ , .(Total=.N) ,.(JobGrade)], on = "JobGrade"][, (2:4) := lapply(.SD, `/`, sum(Total)), .SDcols = 2:4][] 紧凑选项

来完成
base R