我正在分析具有特定人口统计信息的数据集。这些是我要处理的主要变量和子集:
hh_id is_head_of_household married gender age
1 1 single male 28
1 0 single female 27
2 1 married male 33
2 0 married female 34
2 1 single male 6
我需要创建一个变量,以指示这四个特定类别下的家庭类型:“单身男性户主”,“单身女性家庭”,“已婚夫妇”,“未婚夫妇”
例如,每个家庭都有一个唯一的ID,而第一个家庭代表未婚夫妇,因为有至少两个成年人(18岁以上)和至少他们是一家之主(1或0),并且在已婚栏下都被列为“单身”。 第二户是一对已婚夫妇,因为至少有两个成年人,其中一个是头,并且在“已婚”列中被列为“已婚”。 一个“单身男性”或“单身女性”家庭至少要有一个成年男性或女性,同时也是一家之主。家庭中的任何其他个人都必须是孩子(18岁以下)。
我尝试创建一个列,该列使用dplyr来为“每个唯一的家庭ID”指出这四个类别之一:
首先,我创建了一个成人或儿童类别:
individual_data["adult"] <- NA
individual_data$adult <- ifelse(individual_data$age >= 18, "adult",
"child")
这是我到目前为止尝试为单户家庭创建变量的代码:
individual_data["if_adult"] <- ifelse(individual_data$age >= 18, "1","0")
library(dplyr)
individual_data %>%
group_by(hh_id) %>%
mutate(unmarried_couple = sum(if_adult*(married =="Single"))==1,
total_adults = sum(if_adult))
这段代码无法产生预期的结果,我不确定如何去产生另外两个类别。理想情况下,我的新数据集应如下所示:
hh_id is_head_of_household married gender age type
1 1 single male 28 unmarried couple
1 0 single female 27 unmarried couple
2 1 married male 33 married couple
2 0 married female 34 married couple
2 1 single male 6 married couple
..
n ----------------------------------------------------------
每个hh_id只能有一个分类。如何在dplyr中找到解决方案?
数据结构:
structure(list(hh_id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
5L), person_id = 1:10, is_head_of_household = c(1L, 0L, 1L, 0L,
0L, 1L, 0L, 1L, 0L, 1L), married = structure(c(2L, 2L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Married", "Single"), class = "factor"),
gender = structure(c(2L, 5L, 2L, 5L, 5L, 2L, 5L, 2L, 3L,
2L), .Label = c("F", "Female", "FEMALE", "M", "Male", "MALE"
), class = "factor"), race = structure(c(3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Asian", "Black", "White"
), class = "factor"), age = c(28L, 27L, 34L, 33L, 6L, 28L,
29L, 30L, 3L, 30L), voted_in_2012 = c(0L, 1L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 1L), is_college_graduate = c(1L, 1L, 1L,
0L, 1L, 1L, 0L, 1L, 0L, 1L), adult = c("adult", "adult",
"adult", "adult", "child", "adult", "adult", "adult", "child",
"adult")), row.names = c(NA, 10L), class = "data.frame")
答案 0 :(得分:0)
编辑:将married
中的case_when
转换为小写,以捕获该变量的大写形式与样本数据不同的情况。
library(dplyr)
hh_types <- individual_data %>%
filter(age >= 18) %>% # only concerned with adults for categorization
arrange(hh_id, -is_head_of_household) %>% # bring head of hh to top
group_by(hh_id) %>% # For each hh_id...
mutate(adult_count = n()) %>% # ... how many adults
slice(1) %>% # just keep the top row (the head)
ungroup() %>%
mutate(category = case_when(
tolower(married) == "married" & adult_count > 1 ~ "married couple",
tolower(married) == "single" & adult_count > 1 ~ "unmarried couple",
adult_count == 1 ~ paste("single", gender, "head of household"),
TRUE ~ "Other")) %>%
select(hh_id, category)
individual_data %>%
left_join(hh_types)
#Joining, by = "hh_id"
# hh_id is_head_of_household married gender age category
#1 1 1 single male 28 unmarried couple
#2 1 0 single female 27 unmarried couple
#3 2 1 married male 33 married couple
#4 2 0 married female 34 married couple
#5 2 1 single male 6 married couple
#6 3 1 single female 30 single female head of household
#7 4 1 single male 28 single male head of household
添加了单个hh的示例数据:
individual_data <- read.table(
header = T,
stringsAsFactors = F,
colClasses = c("integer", "integer", "character", "character", "integer"),
text = "hh_id is_head_of_household married gender age
1 1 single male 28
1 0 single female 27
2 1 married male 33
2 0 married female 34
2 1 single male 6
3 1 single female 30
4 1 single male 28"
)