我有一个长整型格式的家庭和成员数据集。成员数固定,每个成员对应一列。为简单起见,假设每个家庭有2位成员,并假设询问两个问题,即成员(Q1),性别(Q2)。
文件格式如下所示:
HHID, MEM_ID_1, MEM_ID_2, AGE_1, AGE_2, GENDER_1, GENDER_2
1 1 2 50 45 M F
我想将其转换为以下格式:
HHID MEM_ID AGE GENDER
1 1 50 M
1 2 45 F
答案 0 :(得分:0)
让我们说我们的数据框架是测试
dput(test)
structure(list(HHID = 1L, MEM_ID_1 = 1L, MEM_ID_2 = 2L, AGE_1 = 50L,
AGE_2 = 45L, GENDER_1 = structure(1L, .Label = "Male", class = "factor"),
GENDER_2 = structure(1L, .Label = "Female", class = "factor")), class = "data.frame", row.names = c(NA,
-1L))
您可以在此数据框上尝试以下重塑功能:
reshape(test, direction = "long",
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2")),
v.names = c("MEM_ID","AGE","GENDER"),
idvar = 'HHID')
reshape()函数来自基数R。广义上讲,它可以通过使用变化的参数并将方向设置为 long 来同时融化多组变量。
例如,在您的情况下,我们列出了三个向量,它们的变量名称与可变参数有关:
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2"))
输出如下:
HHID time MEM_ID AGE GENDER
1.1 1 1 1 50 Male
1.2 1 2 2 45 Female
答案 1 :(得分:0)
您可以依次使用tidyr::gather()
,tidyr::separate()
和tidyr::spread()
。 household
是您数据框的名称。
library(tidyverse)
gather
首先,tidyr::gather()
。然后,您可以获得以下结果。
household %>%
gather(-HHID, key = domestic, value = value)
#> HHID domestic value
#> 1 1 MEM_ID_1 1
#> 2 1 MEM_ID_2 2
#> 3 1 AGE_1 50
#> 4 1 AGE_2 45
#> 5 1 GENDER_1 M
#> 6 1 GENDER_2 F
现在您要做的就是
domestic
处分隔_[0-9]
列:在正则表达式中,_(?=[0-9])
household %>%
gather(-HHID, key = domestic, value = value) %>% # long data
separate(domestic, into = c("domestic", "vals"), sep = "_(?=[0-9])") %>% # separate the digit
spread(domestic, value) %>% # wide format
select(HHID, MEM_ID, AGE, GENDER, -vals) # just arranging columns, and excluding needless column
#> HHID MEM_ID AGE GENDER
#> 1 1 1 50 M
#> 2 1 2 45 F