Question

我正在分析具有特定人口统计信息的数据集。这些是我要处理的主要变量和子集：

hh_id   is_head_of_household    married   gender   age
1          1                    single    male     28
1          0                    single    female   27
2          1                    married   male     33
2          0                    married   female   34
2          1                    single    male     6

我需要创建一个变量，以指示这四个特定类别下的家庭类型：“单身男性户主”，“单身女性家庭”，“已婚夫妇”，“未婚夫妇”

例如，每个家庭都有一个唯一的ID，而第一个家庭代表未婚夫妇，因为有至少两个成年人（18岁以上）和至少他们是一家之主（1或0），并且在已婚栏下都被列为“单身”。第二户是一对已婚夫妇，因为至少有两个成年人，其中一个是头，并且在“已婚”列中被列为“已婚”。一个“单身男性”或“单身女性”家庭至少要有一个成年男性或女性，同时也是一家之主。家庭中的任何其他个人都必须是孩子（18岁以下）。

我尝试创建一个列，该列使用dplyr来为“每个唯一的家庭ID”指出这四个类别之一：

首先，我创建了一个成人或儿童类别：

individual_data["adult"] <- NA
individual_data$adult <- ifelse(individual_data$age >= 18, "adult", 
"child")

这是我到目前为止尝试为单户家庭创建变量的代码：

individual_data["if_adult"] <- ifelse(individual_data$age >= 18, "1","0")
library(dplyr)
individual_data %>% 
group_by(hh_id) %>% 
mutate(unmarried_couple = sum(if_adult*(married =="Single"))==1,
total_adults = sum(if_adult))

这段代码无法产生预期的结果，我不确定如何去产生另外两个类别。理想情况下，我的新数据集应如下所示：

   hh_id   is_head_of_household    married   gender   age     type
   1          1                    single    male     28  unmarried couple
   1          0                    single    female   27  unmarried couple
   2          1                    married   male     33    married couple
   2          0                    married   female   34    married couple
   2          1                    single    male     6     married couple
   ..
   n          ----------------------------------------------------------

每个hh_id只能有一个分类。如何在dplyr中找到解决方案？

数据结构：

structure(list(hh_id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 
5L), person_id = 1:10, is_head_of_household = c(1L, 0L, 1L, 0L, 
0L, 1L, 0L, 1L, 0L, 1L), married = structure(c(2L, 2L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Married", "Single"), class = "factor"), 
gender = structure(c(2L, 5L, 2L, 5L, 5L, 2L, 5L, 2L, 3L, 
2L), .Label = c("F", "Female", "FEMALE", "M", "Male", "MALE"
), class = "factor"), race = structure(c(3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Asian", "Black", "White"
), class = "factor"), age = c(28L, 27L, 34L, 33L, 6L, 28L, 
29L, 30L, 3L, 30L), voted_in_2012 = c(0L, 1L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L), is_college_graduate = c(1L, 1L, 1L, 
0L, 1L, 1L, 0L, 1L, 0L, 1L), adult = c("adult", "adult", 
"adult", "adult", "child", "adult", "adult", "adult", "child", 
"adult")), row.names = c(NA, 10L), class = "data.frame")

Answer 1

编辑：将married中的case_when转换为小写，以捕获该变量的大写形式与样本数据不同的情况。

library(dplyr)
hh_types <- individual_data %>%
  filter(age >= 18) %>%  # only concerned with adults for categorization
  arrange(hh_id, -is_head_of_household) %>%   # bring head of hh to top
  group_by(hh_id) %>%              # For each hh_id...
  mutate(adult_count = n()) %>%    # ... how many adults
  slice(1) %>%                     # just keep the top row  (the head)
  ungroup() %>%

  mutate(category = case_when(
    tolower(married) == "married"   & adult_count > 1 ~ "married couple",
    tolower(married) == "single" & adult_count > 1 ~ "unmarried couple",
    adult_count == 1   ~ paste("single", gender, "head of household"),
    TRUE   ~  "Other")) %>%
  select(hh_id, category)


individual_data %>%
  left_join(hh_types)
#Joining, by = "hh_id"
#  hh_id is_head_of_household married gender age                        category
#1     1                    1  single   male  28                unmarried couple
#2     1                    0  single female  27                unmarried couple
#3     2                    1 married   male  33                  married couple
#4     2                    0 married female  34                  married couple
#5     2                    1  single   male   6                  married couple
#6     3                    1  single female  30 single female head of household
#7     4                    1  single   male  28   single male head of household

添加了单个hh的示例数据：

individual_data <- read.table(
  header = T,
  stringsAsFactors = F, 
  colClasses = c("integer", "integer", "character", "character", "integer"),
  text = "hh_id   is_head_of_household    married   gender   age
1          1                    single    male     28
1          0                    single    female   27
2          1                    married   male     33
2          0                    married   female   34
2          1                    single    male     6
3          1                    single    female   30
4          1                    single    male     28"
)

创建一个指示住户类型的变量-子集，条件选择，数据整理-关闭

1 个答案: