我希望能够根据下面R中的示例数据集执行以下操作(实际数据集更长,地址多了几年):
|ID|birthyr |address1990|address1991|address1992|address1993|
|A |1992 |NA |NA |2 |2 |
|B |1990 |2 |2 |3 |3 |
|C |1991 |NA |3 |3 |1 |
我想创建一个新列,其中包含他们出生年份的地址值。理想情况下,我会在每个人的birthyr
中查看年份,并查看哪个列标题包含此字符串然后使用该列中该值的值。我现在有办法做到这一点,见下面的代码,但它不是最好的方法,因为它从数据存在的第一列地址中获取值,我担心这可能导致数据丢失
#dat is the dataset
#add empty columns that new values will go in
dat$birth_address<-NA
dat$address_first_year<-NA
#Take first value from address column which contains data and add the value to birth address and then add the column name to the column address_first_year
J<-seq(3,6,by=1)
for(i in 1:dim(dat)[1]){
for(j in J){
if(!is.na(dat[i,j])){
dat$birth_address[i]<-dat[i,j]
dat$address_first_year[i]<-names(dat)[j]
break
}
}
}
#remove string from address_first_year column and change years to numeric
dat$address_first_year<-sub("address", "", dat$address_first_year)
dat$address_first_year<-as.numeric(dat$address_first_year)
#remove rows where address_first_year is not equal to birthyr to ensure that values in new column are actually from birthyr
for(i in 1:dim(dat)[1]){
if(dat$address_first_year[i] != dat$birthyr[i]){
dat$birth_address[i]<-NA
}
}
在示例中运行上面的代码时,我得到以下结果。虽然这给了我想要的东西,但我认为有些情况不会,因此我想要一种更简洁,更健壮的方式来做这件事。
ID birthyr address1990 address1991 address1992 address1993 birth_address address_first_year
1 A 1992 NA NA 2 2 2 1992
2 B 1990 2 2 3 3 2 1990
3 C 1991 NA 3 3 1 3 1991
编辑: - 根据以下评论更新 这些是我用下面的代码得到的结果,但它似乎不是我所期望的。
ID birthyr address1990 address1991 address1992 address1993 birth_address
1 A 1992 NA NA 2 2 2
2 B 1990 2 2 3 3 3
3 C 1991 NA 3 3 1 2
谢谢
答案 0 :(得分:2)
鉴于dat
是您的数据并使用dplyr
和tidyr
:
library(dplyr)
library(tidyr)
dat %>%
gather(addressYY, value, 3:6) %>%
mutate(BirthAdderess = gsub(x = addressYY, 'address', '')) %>%
filter(birthyr == BirthAdderess)