Question

我想用年份信息来估算年龄。我有一个具有以下特征的数据集：

dat <- data.table(id = c(rep(1, 8), rep(2, 8)), 
                  year = c(2007:2014, 2007:2014), 
                  age = c(1, NA, 3, NA, NA, 5, 7, NA, NA, NA, 30, NA, 32, 35, NA, NA),
                  age_imp= c(1, 2, 3, 4, 5, 5, 7, 8, 28, 29, 30, 31, 32, 35, 36, 37)
)


    id year age age_imp
 1:  1 2007   1       1
 2:  1 2008  NA       2
 3:  1 2009   3       3
 4:  1 2010  NA       4
 5:  1 2011  NA       5
 6:  1 2012   5       5
 7:  1 2013   7       7
 8:  1 2014  NA       8
 9:  2 2007  NA      28
10:  2 2008  NA      29
11:  2 2009  30      30
12:  2 2010  NA      31
13:  2 2011  32      32
14:  2 2012  35      35
15:  2 2013  NA      36
16:  2 2014  NA      37

原始变量age并不总是与年度持续时间相匹配（例如，采访时间不到一年，前一次采访，测量误差等），所以我想保持它像它一样。对于NA行，我想按年开始一个序列（例如，age_imp）。

有关如何操作的任何建议吗？

Answer 1

您可以首先使用第一个非NA年龄来形成线性方程并线性插值＆amp;在每个id内推断而不先处理跳转。

然后，确定每个id的跳跃/年龄步骤。

然后，再次考虑跳跃，对每个组（即一对id和步骤）进行插值和外推。

内联更多解释..

#ensure order is correct before using shift
setorder(dat, id, year)

#' Fill NA by interpolating and extrapolating using a known point
#' 
#' @param dt - data.table
#' @param years - the xout that are required
#' 
#' @return a numeric vector of ages given the years
#' 
extrapolate <- function(dt, years) {
    #find the first non NA entry
    firstnonNA <- head(dt[!is.na(age)], 1)

    #using linear equation y - y_1 = 1 * (x - x_1)
    as.numeric(sapply(years, function(x) (x - firstnonNA$year) + firstnonNA$age))
}

#interp and extrap age for years that are missing age assuming linearity without jumps
dat[, imp1 := extrapolate(.SD, year), by="id"]

#identifying when the age jumps up/down
dat[, jump:=cumsum(
        (!is.na(age) & imp1!=age) |
        (!is.na(age) & !is.na(shift(age)) & (age+1)!=shift(age))
    ), by="id"]

#interp and extrap age for years taking into account jumps
dat[, age_imp1 := extrapolate(.SD, year), by=c("id","jump")]

#print results
dat[,c("imp1","jump"):=NULL][]

#check if the results are identical as requested
dat[, identical(age_imp, age_imp1)]

Answer 2

我终于创建了这个功能：

impute.age <- function(age) {
  if (any(is.na(age))) {
  min.age <- min(age, na.rm = TRUE)
  position <- which(age == min.age)[1] # ties
  if (!is.na(position)) {
   if (position > 1) { # initial values
    for (i in 1:(position-1)) {
      age[position - i] <- age[position] - i
    }
    }
  missing <- which(is.na(age)) # missing data position
  for (i in missing) {
    age[i] = age[i-1] + 1
  }
  } else { age = as.numeric(NA) }
}
return(age)
}

根据年份顺序计算年龄

2 个答案: