我有一份虚构的医院数据表,需要在(不存在的)医院转院时将出院日期更换为最终出院日期。
rows <- sort(c(which(data$TRANSFER_NUM != 0), which(data$TRANSFER_NUM == 1)-1))
subset <- data[rows,]
令人讨厌的是,有些人可以为不同的事件进行多次转移,即
ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM A 1992-12-04 3360 0 A 1993-02-11 3361 1 A 1993-03-10 3362 2 A 1993-11-25 3363 3 B 1987-05-15 3419 0 B 1987-05-19 3420 1 B 1990-02-03 3473 0 B 1990-02-05 3474 1
这意味着
ddply(subset, "ID", mutate, max=max(DISCHARGE_DATE))
会为B人带来错误的结果,而正确的结果应该是:
ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM NEW_DISCHARGE_DATE A 1992-12-04 3360 0 1993-11-25 A 1993-02-11 3361 1 1993-11-25 A 1993-03-10 3362 2 1993-11-25 A 1993-11-25 3363 3 1993-11-25 B 1987-05-15 3419 0 1987-05-19 B 1987-05-19 3420 1 1987-05-19 B 1990-02-03 3473 0 1990-02-05 B 1990-02-05 3474 1 1990-02-05
我想一些额外的分组可能有所帮助,可能是这样的:
ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP NEW_DISCHARGE_DATE A 1992-12-04 3360 0 1 1993-11-25 A 1993-02-11 3361 1 1 1993-11-25 A 1993-03-10 3362 2 1 1993-11-25 A 1993-11-25 3363 3 1 1993-11-25 B 1987-05-15 3419 0 1 1987-05-19 B 1987-05-19 3420 1 1 1987-05-19 B 1990-02-03 3473 0 2 1990-02-05 B 1990-02-05 3474 1 2 1990-02-05
任何帮助都将受到高度赞赏!
答案 0 :(得分:2)
你是对的,你需要一个中间分组列。这是嵌套的ddply
:
ddply(
ddply(df, "ID", mutate, GROUP=cumsum(c(0, diff(TRANSFER_NUM) < 0))),
c("ID", "GROUP"),
mutate, DISCHARGE_NEW=max(as.character(DISCHARGE_DATE))
)
# ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP DISCHARGE_NEW
# 1 A 1992-12-04 3360 0 0 1993-11-25
# 2 A 1993-02-11 3361 1 0 1993-11-25
# 3 A 1993-03-10 3362 2 0 1993-11-25
# 4 A 1993-11-25 3363 3 0 1993-11-25
# 5 B 1987-05-15 3419 0 0 1987-05-19
# 6 B 1987-05-19 3420 1 0 1987-05-19
# 7 B 1990-02-03 3473 0 1 1990-02-05
# 8 B 1990-02-05 3474 1 1 1990-02-05
答案 1 :(得分:1)
尝试:
ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
假设TRANSFER_NUM是连续的,即1:x
根据评论,这是我得到的结果:
subset<-read.table(text="ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM
A 1992-12-04 3360 0
A 1993-02-11 3361 1
A 1993-03-10 3362 2
A 1993-11-25 3363 3
B 1987-05-15 3419 0
B 1987-05-19 3420 1
B 1990-02-03 3473 0
B 1990-02-05 3474 1",header=T)
subset$DISCHARGE_DATE<-as.Date(subset$DISCHARGE_DATE)
ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM max
1 0 A 1992-12-04 3360 0 1993-11-25
2 0 A 1993-02-11 3361 1 1993-11-25
3 0 A 1993-03-10 3362 2 1993-11-25
4 0 A 1993-11-25 3363 3 1993-11-25
5 -6 B 1990-02-03 3473 0 1990-02-05
6 -6 B 1990-02-05 3474 1 1990-02-05
7 -4 B 1987-05-15 3419 0 1987-05-19
8 -4 B 1987-05-19 3420 1 1987-05-19
如果每个ID的grp子顺序是问题,那么只需更改grp定义前面的符号:
ddply(subset, .(ID,grp=-c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))
grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM max
1 0 A 1992-12-04 3360 0 1993-11-25
2 0 A 1993-02-11 3361 1 1993-11-25
3 0 A 1993-03-10 3362 2 1993-11-25
4 0 A 1993-11-25 3363 3 1993-11-25
5 4 B 1987-05-15 3419 0 1987-05-19
6 4 B 1987-05-19 3420 1 1987-05-19
7 6 B 1990-02-03 3473 0 1990-02-05
8 6 B 1990-02-05 3474 1 1990-02-05