根据R中的列标题从变量中提取数据

时间:2017-02-07 20:19:38

标签: r data-processing

我希望能够根据下面R中的示例数据集执行以下操作(实际数据集更长,地址多了几年):

|ID|birthyr   |address1990|address1991|address1992|address1993|
|A |1992      |NA         |NA         |2          |2          |
|B |1990      |2          |2          |3          |3          |
|C |1991      |NA         |3          |3          |1          |

我想创建一个新列,其中包含他们出生年份的地址值。理想情况下,我会在每个人的birthyr中查看年份,并查看哪个列标题包含此字符串然后使用该列中该值的值。我现在有办法做到这一点,见下面的代码,但它不是最好的方法,因为它从数据存在的第一列地址中获取值,我担心这可能导致数据丢失

#dat is the dataset

#add empty columns that new values will go in
dat$birth_address<-NA
dat$address_first_year<-NA

#Take first value from address column which contains data and add the value to  birth address and then add the column name to the column address_first_year
J<-seq(3,6,by=1)
for(i in 1:dim(dat)[1]){
    for(j in J){
        if(!is.na(dat[i,j])){
            dat$birth_address[i]<-dat[i,j]
            dat$address_first_year[i]<-names(dat)[j]
            break
        }
    }
}

#remove string from address_first_year column and change years to numeric
dat$address_first_year<-sub("address", "", dat$address_first_year)
dat$address_first_year<-as.numeric(dat$address_first_year)

#remove rows where address_first_year is not equal to birthyr to ensure that values in new column are actually from birthyr
for(i in 1:dim(dat)[1]){
    if(dat$address_first_year[i] != dat$birthyr[i]){
        dat$birth_address[i]<-NA
    }
}

在示例中运行上面的代码时,我得到以下结果。虽然这给了我想要的东西,但我认为有些情况不会,因此我想要一种更简洁,更健壮的方式来做这件事。

  ID birthyr address1990 address1991 address1992 address1993 birth_address address_first_year
1  A    1992          NA          NA           2           2             2               1992
2  B    1990           2           2           3           3             2               1990
3  C    1991          NA           3           3           1             3               1991

编辑: - 根据以下评论更新 这些是我用下面的代码得到的结果,但它似乎不是我所期望的。

  ID birthyr address1990 address1991 address1992 address1993 birth_address
1  A    1992          NA          NA           2           2             2
2  B    1990           2           2           3           3             3
3  C    1991          NA           3           3           1             2

谢谢

1 个答案:

答案 0 :(得分:2)

鉴于dat是您的数据并使用dplyrtidyr

library(dplyr)
library(tidyr)
dat %>% 
  gather(addressYY, value, 3:6) %>% 
  mutate(BirthAdderess = gsub(x = addressYY, 'address', '')) %>% 
  filter(birthyr == BirthAdderess)