R中按ID和序列对表进行分组,没有间隙

时间:2014-01-16 16:42:16

标签: r plyr

我有一份虚构的医院数据表,需要在(不存在的)医院转院时将出院日期更换为最终出院日期。

rows <- sort(c(which(data$TRANSFER_NUM != 0), which(data$TRANSFER_NUM == 1)-1))
subset <- data[rows,]

令人讨厌的是,有些人可以为不同的事件进行多次转移,即

 
ID     DISCHARGE_DATE   FILE_SEQUENCE   TRANSFER_NUM
A      1992-12-04       3360            0
A      1993-02-11       3361            1
A      1993-03-10       3362            2
A      1993-11-25       3363            3
B      1987-05-15       3419            0
B      1987-05-19       3420            1
B      1990-02-03       3473            0
B      1990-02-05       3474            1

这意味着

ddply(subset, "ID", mutate, max=max(DISCHARGE_DATE))

会为B人带来错误的结果,而正确的结果应该是:

 
ID     DISCHARGE_DATE   FILE_SEQUENCE   TRANSFER_NUM    NEW_DISCHARGE_DATE
A      1992-12-04       3360            0               1993-11-25 
A      1993-02-11       3361            1               1993-11-25 
A      1993-03-10       3362            2               1993-11-25 
A      1993-11-25       3363            3               1993-11-25 
B      1987-05-15       3419            0               1987-05-19    
B      1987-05-19       3420            1               1987-05-19    
B      1990-02-03       3473            0               1990-02-05
B      1990-02-05       3474            1               1990-02-05

我想一些额外的分组可能有所帮助,可能是这样的:

 
ID     DISCHARGE_DATE   FILE_SEQUENCE   TRANSFER_NUM    GROUP    NEW_DISCHARGE_DATE
A      1992-12-04       3360            0               1        1993-11-25 
A      1993-02-11       3361            1               1        1993-11-25 
A      1993-03-10       3362            2               1        1993-11-25 
A      1993-11-25       3363            3               1        1993-11-25 
B      1987-05-15       3419            0               1        1987-05-19    
B      1987-05-19       3420            1               1        1987-05-19    
B      1990-02-03       3473            0               2        1990-02-05
B      1990-02-05       3474            1               2        1990-02-05

任何帮助都将受到高度赞赏!

2 个答案:

答案 0 :(得分:2)

你是对的,你需要一个中间分组列。这是嵌套的ddply

ddply(
  ddply(df, "ID", mutate, GROUP=cumsum(c(0, diff(TRANSFER_NUM) < 0))),
  c("ID", "GROUP"),
  mutate, DISCHARGE_NEW=max(as.character(DISCHARGE_DATE))
)
#   ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM GROUP DISCHARGE_NEW
# 1  A     1992-12-04          3360            0     0    1993-11-25
# 2  A     1993-02-11          3361            1     0    1993-11-25
# 3  A     1993-03-10          3362            2     0    1993-11-25
# 4  A     1993-11-25          3363            3     0    1993-11-25
# 5  B     1987-05-15          3419            0     0    1987-05-19
# 6  B     1987-05-19          3420            1     0    1987-05-19
# 7  B     1990-02-03          3473            0     1    1990-02-05
# 8  B     1990-02-05          3474            1     1    1990-02-05

答案 1 :(得分:1)

尝试:

ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))

假设TRANSFER_NUM是连续的,即1:x

根据评论,这是我得到的结果:

subset<-read.table(text="ID     DISCHARGE_DATE   FILE_SEQUENCE   TRANSFER_NUM
A      1992-12-04       3360            0
A      1993-02-11       3361            1
A      1993-03-10       3362            2
A      1993-11-25       3363            3
B      1987-05-15       3419            0
B      1987-05-19       3420            1
B      1990-02-03       3473            0
B      1990-02-05       3474            1",header=T)

subset$DISCHARGE_DATE<-as.Date(subset$DISCHARGE_DATE)

ddply(subset, .(ID,grp=c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))

  grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM        max
1   0  A     1992-12-04          3360            0 1993-11-25
2   0  A     1993-02-11          3361            1 1993-11-25
3   0  A     1993-03-10          3362            2 1993-11-25
4   0  A     1993-11-25          3363            3 1993-11-25
5  -6  B     1990-02-03          3473            0 1990-02-05
6  -6  B     1990-02-05          3474            1 1990-02-05
7  -4  B     1987-05-15          3419            0 1987-05-19
8  -4  B     1987-05-19          3420            1 1987-05-19

如果每个ID的grp子顺序是问题,那么只需更改grp定义前面的符号:

ddply(subset, .(ID,grp=-c(0,cumsum(diff(subset$TRANSFER_NUM)-1))), mutate, max=max(DISCHARGE_DATE))

  grp ID DISCHARGE_DATE FILE_SEQUENCE TRANSFER_NUM        max
1   0  A     1992-12-04          3360            0 1993-11-25
2   0  A     1993-02-11          3361            1 1993-11-25
3   0  A     1993-03-10          3362            2 1993-11-25
4   0  A     1993-11-25          3363            3 1993-11-25
5   4  B     1987-05-15          3419            0 1987-05-19
6   4  B     1987-05-19          3420            1 1987-05-19
7   6  B     1990-02-03          3473            0 1990-02-05
8   6  B     1990-02-05          3474            1 1990-02-05