Question

我想创建一个以多记录ID为特色的生存数据集。现有事件数据由一行观察组成，日期格式为dd/mm/yy。这个想法是计算至少有一个事件/月的连续月数（有多年，因此必须以某种方式计算）。换句话说，我想创建剧集来捕捉这样的每月条纹，包括不活动的时段。举个例子，代码应该转换成这样的东西：

df1
id        event.date
group1    01/01/16
group1    05/02/16
group1    07/03/16
group1    10/06/16
group1    12/09/16

到此：

df2
id        t0    t1    ep.no   ep.t   ep.type
group1    1     3     1       3      1  
group1    4     5     2       2      0
group1    6     6     3       1      1
group1    7     8     4       2      0
group1    9     9     5       1      1
group1    10    ...   ...     ...    ...

其中t0和t1是开始和结束月份，ep.no是特定id的剧集计数器，ep.t是其长度特定剧集，而ep.type表示剧集的类型（有效/无效）。在上面的例子中，有一个最初的三个月的活动，然后是两个月的休息，然后是一个月的复发事件等。

我最关心的是t0和t1从df1到df2的转变，因为df2中的其他变量可以是之后根据它们构造（例如，no是计数器，时间是算术，类型总是以1开始并且交替）。鉴于问题的复杂性（至少对我而言），我需要提供实际数据，但我不确定是否允许这样做？如果一个mod进入，我会看到我能做些什么。

Answer 1

我认为这可以满足您的需求。诀窍是确定需要一起处理的观察序列，并使用dplyr::lag和cumsum是可行的方法。

# Convert to date objects, summarize by month, insert missing months
library(tidyverse)
library(lubridate)

# added rows of data to demonstrate that it works with 
# > id and > 1 event per month and rolls across year end
df1 <- read_table("id        event.date
group1    01/01/16
group1    02/01/16
group1    05/02/16
group1    07/03/16
group1    10/06/16
group1    12/09/16
group1    01/02/17
group2    01/01/16
group2    05/02/16
group2    07/03/16",col_types="cc")

# need to get rid of extra whitespace, but automatically converts to date
# summarize by month to count events per month
df1.1 <- mutate(df1, event.date=dmy(event.date),
              yr=year(event.date),
              mon=month(event.date))

# get down to one row per event and complete data
df2 <- group_by(df1.1,id,yr,mon) %>%
  summarize(events=n()) %>%
  complete(id, yr, mon=1:12, fill=list(events=0)) %>%
  group_by(id) %>%
  mutate(event = as.numeric(events >0),
    is_start=lag(event,default=-1)!=event,
    episode=cumsum(is_start), 
    episode.date=ymd(paste(yr,mon,1,sep="-"))) %>%
  group_by(id, episode) %>%
  summarize(t0 = first(episode.date),
            t1 = last(episode.date) %m+% months(1),
            ep.length = as.numeric((last(episode.date) %m+% months(1)) - first(episode.date)),
            ep.type = first(event))

给出

Source: local data frame [10 x 6]
Groups: id [?]

       id episode         t0         t1 ep.length ep.type
    <chr>   <int>     <dttm>     <dttm>     <dbl>   <dbl>
1  group1       1 2016-01-01 2016-04-01        91       1
2  group1       2 2016-04-01 2016-06-01        61       0
3  group1       3 2016-06-01 2016-07-01        30       1
4  group1       4 2016-07-01 2016-09-01        62       0
5  group1       5 2016-09-01 2016-10-01        30       1
6  group1       6 2016-10-01 2017-02-01       123       0
7  group1       7 2017-02-01 2017-03-01        28       1
8  group1       8 2017-03-01 2018-01-01       306       0
9  group2       1 2016-01-01 2016-04-01        91       1
10 group2       2 2016-04-01 2017-01-01       275       0

将complete()与mon=1:12一起使用将始终使最后一集延伸至该年末。解决方案是在filter()

之后在yr和mon上插入complete()

将t0和t1保持为日期 - 时间对象的优点是它们可以跨年度边界正常工作，使用月份数字赢得了。

会议信息：

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

other attached packages:
[1] lubridate_1.3.3 dplyr_0.5.0     purrr_0.2.2    
[4] readr_0.2.2     tidyr_0.6.0     tibble_1.2     
[7] ggplot2_2.2.0   tidyverse_1.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8      knitr_1.15.1     magrittr_1.5    
 [4] munsell_0.4.2    colorspace_1.2-6 R6_2.1.3        
 [7] stringr_1.1.0    highr_0.6        plyr_1.8.4      
[10] tools_3.3.2      grid_3.3.2       gtable_0.2.0    
[13] DBI_0.5          lazyeval_0.2.0   assertthat_0.1  
[16] digest_0.6.10    memoise_1.0.0    evaluate_0.10   
[19] stringi_1.1.2    scales_0.4.1

R根据事件数据创建时变生存数据集

1 个答案: