计算比例并忽略NA

时间:2018-04-11 11:48:26

标签: r dplyr

我有一个类似于以下的数据集,我的最终目标是制作一个表格,显示每个性别的平均工资和女性的平均工资等男性比例变量。

library(dplyr)
x <- data.frame(Department = c("Dep1", "Dep1","Dep2", "Dep2","Dep3"),
            Gender = c("F", "M",  "F", "M", "F"),
            Salary = seq(10,14))

      Department Gender Salary
1       Dep1      F     10
2       Dep1      M     11
3       Dep2      F     12
4       Dep2      M     13
5       Dep3      F     14

步骤1:首先,我使用汇总计算所需的汇总统计数据。

Table <- x %>% group_by(Department, Gender) %>% summarise(Count = n(),
                                                      AverageSalary = mean(Salary, na.rm = T),
                                                      MedianSalary = median(Salary, na.rm = T))

步骤2:要计算比例并将新列添加到“表格”,我会使用几天前从此论坛获得的提示。

Table %>% group_by(Department) %>% 
mutate(`AvgSalaryWomen/Men` = AverageSalary[Gender == "F"]/AverageSalary[Gender == "M"],
     `MedianSalaryWomen/Men` = MedianSalary[Gender == "F"]/MedianSalary[Gender == "M"])

我的挑战是Dep3没有任何男性,因此我收到以下错误消息:

Error in mutate_impl(.data, dots) : 
Column `AvgSalaryWomen/Men` must be length 1 (the group size), not 0

我希望的是这样的事情

  Department Gender Count AverageSalary MedianSalary AvgSalaryWomen.Men MedianSalaryWomen.Men
1       Dep1      F     1            10           10          0.9090909             0.9090909
2       Dep1      M     1            11           11          0.9090909             0.9090909
3       Dep2      F     1            12           12          0.9230769             0.9230769
4       Dep2      M     1            13           13          0.9230769             0.9230769
5       Dep3      F     1            14           14                 NA                    NA

或者

  Department Gender Count AverageSalary MedianSalary AvgSalaryWomen.Men MedianSalaryWomen.Men
1       Dep1      F     1            10           10          0.9090909             0.9090909
2       Dep1      M     1            11           11                 NA                    NA
3       Dep2      F     1            12           12          0.9230769             0.9230769
4       Dep2      M     1            13           13                 NA                    NA
5       Dep3      F     1            14           14                 NA                    NA

有没有一种简单的方法可以获得这两种结果中的任何一种?我猜测替代1将是最简单的。 提前谢谢!

1 个答案:

答案 0 :(得分:1)

使用ifelse,您可以在计算比率之前检查某个部门是否存在两种性别(如果不存在,则返回NA)。像这样:

Table %>% group_by(Department) %>% 
  mutate(`AvgSalaryWomen/Men` = ifelse(length(unique(Gender)) == 2,
         AverageSalary[Gender == "F"]/AverageSalary[Gender == "M"], NA),
         `MedianSalaryWomen/Men` = ifelse(length(unique(Gender)) == 2, 
          MedianSalary[Gender == "F"]/MedianSalary[Gender == "M"], NA))
# A tibble: 5 x 7
# Groups:   Department [3]
  Department Gender Count AverageSalary MedianSalary `AvgSalaryWomen/Men` `MedianSalaryWomen/Men`
  <fct>      <fct>  <int>         <dbl>        <int>                <dbl>                   <dbl>
1 Dep1       F          1          10.0           10                0.909                   0.909
2 Dep1       M          1          11.0           11                0.909                   0.909
3 Dep2       F          1          12.0           12                0.923                   0.923
4 Dep2       M          1          13.0           13                0.923                   0.923
5 Dep3       F          1          14.0           14               NA                      NA