在R中计算年龄和队列样本大小的简便方法

时间:2018-01-16 18:28:45

标签: r

在前瞻性研究中,您希望总结您的样本的年龄,观察的年份,以及完全观察它们的时间。这些共同考虑样本的年龄,周期和群组时间尺度。

最简单的说明方式是使用模拟数据:

假设这些数据总结了一组临床患者的基线年龄以及开始和停止观察日期:

set.seed(123)
n <- 10000
Obs <- data.frame(
  'age' = sample(seq(40, 80, by=5), n, replace=T),
  'start' = as.Date(n0 <- runif(n, 10000, 12000), origin="1970-01-01"),
  'end' = as.Date(n0 + runif(n, 0, 3652.5), origin="1970-01-01")
)

我希望foo采用向量

AgeCut <- c(0, 65, Inf)
Yrcut <- c(0, 2000, Inf)
DurCut <- c(0, 5, Inf)

并交叉列出至少一天内属于这些值的每种可能排列的个体数量。或者,更复杂的是,一个人属于一个类别的年数。例如,当他们在1990年进入样本并且留在30年时为40岁的人,当他们进入yt65 / bf2000 /gt5年并且在那里待了5年时,将在yt65 / bf2000 / lt5year类别中持续5年。他们进入yt65 / af2000 / gt5year再过15年,最后ot65 / af2000 / gt5year

出于某种原因,这对我的大脑影响很大,我无法计算实际所需的输出,即使是通过一些低效的for循环,但格式和结构将类似于:

        AgeCut             YrCut            DurCut  NumObs
1 younger than 65    before 2000 less than 5 years    1000
2    65 and older    before 2000 less than 5 years    1000   
3 younger than 65 2000 and later less than 5 years    1000
4    65 and older 2000 and later less than 5 years    1000
5 younger than 65    before 2000   5 or more years    1000
6    65 and older    before 2000   5 or more years    1000
7 younger than 65 2000 and later   5 or more years    1000
8    65 and older 2000 and later   5 or more years    1000

2 个答案:

答案 0 :(得分:1)

使用一些tidyverse函数,我想你想要这样的东西

library(tidyverse)
AgeCut <- c(0, 65, Inf)
Yrcut <- c(0, 2000, Inf)
DurCut <- c(0, 5, Inf)

Obs %>% transmute (
  ageCat = cut(age, AgeCut, c("younger than 65 ","65 and older"), right=FALSE),
  startCat = cut(year(start), Yrcut, c("before 2000", "2000 and later"), right=FALSE),
  DurCut = cut(year(end)-year(start), DurCut, c("less than 5 years", "5 or more years"), right=FALSE)
)  %>% table() %>% as_data_frame()

返回

            ageCat       startCat            DurCut     n
             <chr>          <chr>             <chr> <int>
1 younger than 65     before 2000 less than 5 years  1196
2     65 and older    before 2000 less than 5 years   968
3 younger than 65  2000 and later less than 5 years  1312
4     65 and older 2000 and later less than 5 years  1015
5 younger than 65     before 2000   5 or more years  1503
6     65 and older    before 2000   5 or more years  1185
7 younger than 65  2000 and later   5 or more years  1580
8     65 and older 2000 and later   5 or more years  1241

cut()函数正在完成大部分工作。

答案 1 :(得分:0)

好的我在基础R中有这个实现。它递归地计算在当前类别中花费的时间,直到移动到下一个,将持续时间添加到各个计数器并从参与的整个持续时间中减去它,然后提供将更新的时间和持续时间更新为apc函数。

apc <- function(times, cuts, dur, strata=1) {
  class <- mapply(findInterval, times, cuts)
  tnext <- mapply( ## times until next category
    function(t, c, i) {c[i+1] - t}, 
    times, cuts, as.data.frame(class)
  )
  mnext <- apply(tnext, 1, min, na.rm=T) ## minimum time to next category
  mnext <- pmin(mnext, dur) ## truncate if duration exceeded before next
  dur <- dur-mnext
  times <- lapply(times, `+`, mnext)
  if (all(dur == 0))
    return(list(data.frame(class, 't'=mnext, strata)))
  return(c(list(data.frame(class, 't'=mnext, strata)), apc(times, cuts, dur, strata=strata)))
}

这估计每个类别中的以下人数年份为:

> val
  age start cohort strata         t
1   1     1      1      1  3175.986
2   2     1      1      1  2582.793
3   1     2      1      1 17714.503
4   2     2      1      1 13972.134
5   1     2      2      1  5658.430
6   2     2      2      1  6957.702

其中总和(50,061.55)等于Obs$end-Obs$start的总和。