dplyr创建因子水平的总百分比

时间:2016-08-09 18:17:57

标签: r dplyr

如何使用dplyr为每个状态创建一个因子变量级别的比例?例如,我想添加一个变量,指示每个状态中女性占数据框的百分比。

# gen data
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )  
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)

以下是我知道错误的尝试,但让我可以访问这些信息:

 middle %>%
   group_by(state, gender %in%c("Female")) %>%
   summarise(count = n()) %>%
   mutate(test_count = count)

我对count和mutate函数很难,这使得很难进一步发展。它没有像我期望的那样表现。

3 个答案:

答案 0 :(得分:8)

向现有数据框添加新列:

school.data %>% 
    group_by(state) %>%
    mutate(pct.female = mean(gender == "Female"))

如果您只想为每个州设置一行而不是在原始数据中添加一列,请使用summarize而不是mutate

school.data %>%
   group_by(state) %>%
   summarize(pct.female = mean(gender == "Female"))
# # A tibble: 2 x 2
#    state pct.female
#   <fctr>      <dbl>
# 1  Idaho       0.75
# 2  Maine       0.70

答案 1 :(得分:7)

格雷戈尔的答案触及了它的核心。这是一个版本,可以为每个州的两个性别提供计数和比例:

library(dplyr)

gender.proportions <- group_by(school.data, state, gender) %>% 
  summarize(n = length(student.id)) %>% # count per gender
  ungroup %>% group_by(state) %>% 
  mutate(proportion = n / sum(n)) # proportion per gender

#   state gender     n proportion
#  <fctr> <fctr> <int>      <dbl>
#1  Idaho Female    16       0.80  
#2  Idaho   Male     4       0.20
#3  Maine Female    11       0.55
#4  Maine   Male     9       0.45

编辑:

在参考OP的评论/要求时,下面的代码将重复每个州每个学生的男女比例:

gender.proportions <- group_by(school.data, state) %>% 
  mutate(prop.female = mean(gender == 'Female'), prop.male = mean(gender == 'Male'))

   student.id  state gender prop.female prop.male
        <int> <fctr> <fctr>       <dbl>     <dbl>
1         479  Idaho   Male         0.8       0.2
2         634  Idaho Female         0.8       0.2
3         175  Idaho Female         0.8       0.2
4         527  Idaho Female         0.8       0.2
5         368  Idaho Female         0.8       0.2
6         423  Idaho   Male         0.8       0.2
7         357  Idaho Female         0.8       0.2
8         994  Idaho Female         0.8       0.2
9         479  Idaho Female         0.8       0.2
10        634  Idaho Female         0.8       0.2
# ... with 30 more rows

答案 2 :(得分:3)

以下是使用left_join的一种解决方案。

state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )  
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)

school.data %>%
    group_by(state) %>%
    mutate(gender_id = ifelse(gender == "Female", 1, 0)) %>%
    summarise(female_count = sum(gender_id)) %>%

    left_join(school.data %>%
                  group_by(state) %>%
                  summarise(state_count = n()),

              by = c("state" = "state")
    ) %>%
    mutate(percent_female = female_count / state_count)