我有一个庞大的数据集,但我会根据一个例子来解释这个问题。
在下表中,我有Unique ID
,我有Date
(每个Unique ID
的时间序列)和Unique ID
的状态。
Unique ID Date Status
1 Jan-06 Active
1 Feb-06 Active
1 Mar-06 Not active
1 Apr-06 Stable
1 May-06 Active
1 Jun-06 Stable
1 Jul-06 Active
1 Aug-06 Active
1 Sep-06 Active
2 Oct-06 Active
2 Nov-06 Not active
2 Dec-06 Stable
2 Jan-07 Active
2 Feb-07 Stable
2 Mar-07 Active
2 Apr-07 Active
2 May-07 Active
我想要的结果是捕获下一个发生事件的日期(不活跃或稳定)
如果您看到下面的独特ID 1在2006年1月处于活动状态,我们需要捕获它何时达不到活动或稳定,它会在06年3月达到一段时间。
如果您看到下面的独特ID 1在2006年5月处于活动状态,我们需要捕获它何时达不到活动或稳定,它会在06年6月达到一段时间。
注意:无需为已经处于非活动状态或稳定状态的ID添加任何日期
Unique ID Date Status Result
1 Jan-06 Active Mar-06
1 Feb-06 Active Apr-06
1 Mar-06 Not active NA
1 Apr-06 Stable NA
1 May-06 Active Jun-06
1 Jun-06 Stable NA
1 Jul-06 Active Always active
1 Aug-06 Active Always active
1 Sep-06 Active Always active
2 Oct-06 Active Nov-06
2 Nov-06 Not active NA
2 Dec-06 Stable NA
2 Jan-07 Active Feb-07
2 Feb-07 Stable NA
2 Mar-07 Active Always active
2 Apr-07 Active Always active
2 May-07 Active Always active
答案 0 :(得分:0)
生成示例数据:
> set.seed(1234)
> id <- rep(1:3, times=c(10,10,10))
> date <- seq(as.Date("2000/1/1"), by = "month", length.out = length(id))
> date <- format(date, "%b-%y")
> status <- sample(c("Active", "Not Active", "Stable"), length(id), replace=TRUE)
> data.frame(id, date, status)
id date status
1 1 Jan-00 Active
2 1 Feb-00 Not Active
3 1 Mar-00 Not Active
4 1 Apr-00 Not Active
5 1 May-00 Stable
6 1 Jun-00 Not Active
7 1 Jul-00 Active
8 1 Aug-00 Active
9 1 Sep-00 Not Active
10 1 Oct-00 Not Active
11 2 Nov-00 Stable
12 2 Dec-00 Not Active
13 2 Jan-01 Active
14 2 Feb-01 Stable
15 2 Mar-01 Active
16 2 Apr-01 Stable
17 2 May-01 Active
18 2 Jun-01 Active
19 2 Jul-01 Active
20 2 Aug-01 Active
21 3 Sep-01 Active
22 3 Oct-01 Active
23 3 Nov-01 Active
24 3 Dec-01 Active
25 3 Jan-02 Active
26 3 Feb-02 Stable
27 3 Mar-02 Not Active
28 3 Apr-02 Stable
29 3 May-02 Stable
30 3 Jun-02 Active
将id
和date
个变量的副本中的每个status
替换为“始终有效”的最后一个“有效”观察结果:
> id_last_active <- (id != c(id[-1], FALSE)) & (status == "Active")
> date2 <- as.character(date)
> date2[id_last_active] <- "Always Active"
> status2 <- status
> status2[id_last_active] <- "Always Active"
使用which()
获取非“有效”观察的位置(无论id
),并使用diff()
和rep()
创建一个位置向量先前“活动”事件组的下一个非“活动”事件。然后获取“活动”事件的下一个非活动事件的日期:
> nonactive <- which(status2 != "Active")
> status_len <- c(nonactive[1], diff(nonactive))
> next_event <- date2[rep(nonactive, times = status_len)]
> next_event[status2 %in% c("Stable", "Not Active")] <- NA
按id
打印数据(使用by()
):
> dat <- data.frame(id, date, status, next_event)
> by(dat, dat$id, function(x) x)
dat$id: 1
id date status next_event
1 1 Jan-00 Active Feb-00
2 1 Feb-00 Not Active <NA>
3 1 Mar-00 Not Active <NA>
4 1 Apr-00 Not Active <NA>
5 1 May-00 Stable <NA>
6 1 Jun-00 Not Active <NA>
7 1 Jul-00 Active Sep-00
8 1 Aug-00 Active Sep-00
9 1 Sep-00 Not Active <NA>
10 1 Oct-00 Not Active <NA>
---------------------------------------------------------------
dat$id: 2
id date status next_event
11 2 Nov-00 Stable <NA>
12 2 Dec-00 Not Active <NA>
13 2 Jan-01 Active Feb-01
14 2 Feb-01 Stable <NA>
15 2 Mar-01 Active Apr-01
16 2 Apr-01 Stable <NA>
17 2 May-01 Active Always Active
18 2 Jun-01 Active Always Active
19 2 Jul-01 Active Always Active
20 2 Aug-01 Active Always Active
---------------------------------------------------------------
dat$id: 3
id date status next_event
21 3 Sep-01 Active Feb-02
22 3 Oct-01 Active Feb-02
23 3 Nov-01 Active Feb-02
24 3 Dec-01 Active Feb-02
25 3 Jan-02 Active Feb-02
26 3 Feb-02 Stable <NA>
27 3 Mar-02 Not Active <NA>
28 3 Apr-02 Stable <NA>
29 3 May-02 Stable <NA>
30 3 Jun-02 Active Always Active