我有一对夫妻的数据,其中包括“家庭数量”,“户主”,“教育”,“收入”等变量。 “家庭号码”是唯一分配给每个家庭的身份证号码。 “户主”是指该人是否是户主(1 =户主,2 =户主的配偶),“教育”和“收入”分别是教育水平和个人收入。例如,数据如下所示。
'household_number' 'head_of_household' 'education' 'income'
1 1 high 1000
1 2 low 100
3 1 medium 500
3 2 high 800
4 2 high 800
4 1 high 800
9 1 low 150
9 2 low 200
我想为每个人创建配偶的变量。所以数据如下所示。 “配偶edu”是配偶的教育水平,“配偶公司”是配偶的收入。
'household_number' 'head_of_household' 'education' 'income' 'spouse_edu' 'spouse_inc'
1 1 high 1000 low 100
1 2 low 100 high 1000
3 1 medium 500 high 800
3 2 high 800 medium 500
4 2 high 800 high 800
4 1 high 800 high 800
9 1 low 150 low 200
9 2 low 200 low 150
我有非常大的数据集,所以我正在寻找简单的方法来做到这一点。有没有优雅的方法来做到这一点?
以下是可重现的示例语法。
household_number <- c(1,1,3,3,4,4,9,9)
head_of_household <- c(1,2,1,2,2,1,1,2)
education <- c("high", "low", "medium", "high", "high", "high", "low", "low")
income <- c(1000, 100, 500, 800, 800, 800, 150, 200)
data <- data.frame(household_number, head_of_household, education, income)
答案 0 :(得分:7)
您可以在此处使用base::rev
和dplyr
。
library(dplyr)
data %>%
group_by(household_number) %>%
mutate(spouse_income = rev(income),
spouse_education = rev(education)) %>%
ungroup()
# A tibble: 8 x 6
# household_number head_of_household education income spouse_income spouse_education
# <dbl> <dbl> <fctr> <dbl> <dbl> <fctr>
#1 1 1 high 1000 100 low
#2 1 2 low 100 1000 high
#3 3 1 medium 500 800 high
#4 3 2 high 800 500 medium
#5 4 2 high 800 800 high
#6 4 1 high 800 800 high
#7 9 1 low 150 200 low
#8 9 2 low 200 150 low
使用data.table
的解决方案。
library(data.table)
setDT(data)[, c("spouse_income", "spouse_education") := .(rev(income), rev(education)),
by = household_number][]
# same as
# setDT(data)[, `:=`(spouse_income = rev(income),
# spouse_education = rev(education)),
# by = household_number][]
在base R
可以做
transform(data,
spouse_income = ave(income, household_number, FUN = rev),
spouse_education = ave(education, household_number, FUN = rev))
答案 1 :(得分:1)
使用shift
中的data.table
解决此问题的另一种方法。这将是两步过程。
首先按household_number
分组,并使用shift
lag
data[,':='(
spouse_edu = shift(education),
spouse_inc = shift(income)),
by = household_number]
> data
household_number head_of_household education income spouse_edu spouse_inc
1: 1 1 high 1000 NA NA
2: 1 2 low 100 high 1000
3: 3 1 medium 500 NA NA
4: 3 2 high 800 medium 500
5: 4 2 high 800 NA NA
6: 4 1 high 800 high 800
7: 9 1 low 150 NA NA
8: 9 2 low 200 low 150
现在,使用lead
类型shift
填写其他集的配偶详细信息。确保我们不会替换已填写或更新的配偶详细信息。
data[,':='(
spouse_edu = ifelse( is.na(spouse_edu), shift(education, type="lead"), spouse_edu) ,
spouse_inc = ifelse( is.na(spouse_inc), shift(income, type="lead"), spouse_inc)),
by = household_number]
> data
household_number head_of_household education income spouse_edu spouse_inc
1: 1 1 high 1000 low 100
2: 1 2 low 100 high 1000
3: 3 1 medium 500 high 800
4: 3 2 high 800 medium 500
5: 4 2 high 800 high 800
6: 4 1 high 800 high 800
7: 9 1 low 150 low 200
8: 9 2 low 200 low 150