如何使用dplyr为每个状态创建一个因子变量级别的比例?例如,我想添加一个变量,指示每个状态中女性占数据框的百分比。
# gen data
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)
以下是我知道错误的尝试,但让我可以访问这些信息:
middle %>%
group_by(state, gender %in%c("Female")) %>%
summarise(count = n()) %>%
mutate(test_count = count)
我对count和mutate函数很难,这使得很难进一步发展。它没有像我期望的那样表现。
答案 0 :(得分:8)
向现有数据框添加新列:
school.data %>%
group_by(state) %>%
mutate(pct.female = mean(gender == "Female"))
如果您只想为每个州设置一行而不是在原始数据中添加一列,请使用summarize
而不是mutate
。
school.data %>%
group_by(state) %>%
summarize(pct.female = mean(gender == "Female"))
# # A tibble: 2 x 2
# state pct.female
# <fctr> <dbl>
# 1 Idaho 0.75
# 2 Maine 0.70
答案 1 :(得分:7)
library(dplyr)
gender.proportions <- group_by(school.data, state, gender) %>%
summarize(n = length(student.id)) %>% # count per gender
ungroup %>% group_by(state) %>%
mutate(proportion = n / sum(n)) # proportion per gender
# state gender n proportion
# <fctr> <fctr> <int> <dbl>
#1 Idaho Female 16 0.80
#2 Idaho Male 4 0.20
#3 Maine Female 11 0.55
#4 Maine Male 9 0.45
在参考OP的评论/要求时,下面的代码将重复每个州每个学生的男女比例:
gender.proportions <- group_by(school.data, state) %>%
mutate(prop.female = mean(gender == 'Female'), prop.male = mean(gender == 'Male'))
student.id state gender prop.female prop.male
<int> <fctr> <fctr> <dbl> <dbl>
1 479 Idaho Male 0.8 0.2
2 634 Idaho Female 0.8 0.2
3 175 Idaho Female 0.8 0.2
4 527 Idaho Female 0.8 0.2
5 368 Idaho Female 0.8 0.2
6 423 Idaho Male 0.8 0.2
7 357 Idaho Female 0.8 0.2
8 994 Idaho Female 0.8 0.2
9 479 Idaho Female 0.8 0.2
10 634 Idaho Female 0.8 0.2
# ... with 30 more rows
答案 2 :(得分:3)
以下是使用left_join
的一种解决方案。
state <- rep(c(rep("Idaho", 10), rep("Maine", 10)), 2)
student.id <- sample(1:1000,8,replace=T)
gender <- rep( c("Male","Female"), 100*c(0.25,0.75) )
gender <- sample(gender, 40)
school.data <- data.frame(student.id, state, gender)
school.data %>%
group_by(state) %>%
mutate(gender_id = ifelse(gender == "Female", 1, 0)) %>%
summarise(female_count = sum(gender_id)) %>%
left_join(school.data %>%
group_by(state) %>%
summarise(state_count = n()),
by = c("state" = "state")
) %>%
mutate(percent_female = female_count / state_count)