r中宽数据帧的困难

时间:2015-03-05 19:25:15

标签: r dplyr tidyr

我有数据框(如下所示),其中包含在单次录取期间和不同录取期间接收不同诊断(DX)的病例(ID)。我想扩大这个数据框,以便每个单独的录入在单独的列中具有所有诊断。我尝试了dplyr spread功能,但没有给出正确的结果。有什么建议吗?

ID   DX   Age   Admitted
1    a    17     3/2/14
1    b    17     3/2/14
1    c    17     4/30/14
2    e    20     7/22/13
2    a    20     7/22/13
2    c    20     7/22/13
2    d    20      2/4/14
3    b    16      4/18/14
4    e    16     10/8/13
4    m    16     10/8/13

预期输出如下:

ID   DX1   DX2   DX3   Age   Admitted
1    a     b      NA    17     3/2/14
1    c     NA     NA    17     4/30/14
2    e     a      c     20     7/22/13
2    d     NA     NA    20      2/4/14
3    b     NA     NA    16      4/18/14
4    e     m      NA    16     10/8/13

1 个答案:

答案 0 :(得分:0)

可能有帮助

 df1$ind <- with(df1, paste0('DX',ave(seq_along(ID), 
                ID, Admitted, FUN=seq_along)))
 library(reshape2)
 dcast(df1, ...~ind, value.var='DX')
 #    ID Age Admitted DX1  DX2  DX3
 #1  1  17   3/2/14   a    b <NA>
 #2  1  17  4/30/14   c <NA> <NA>
 #3  2  20   2/4/14   d <NA> <NA>
 #4  2  20  7/22/13   e    a    c
 #5  3  16  4/18/14   b <NA> <NA>
 #6  4  16  10/8/13   e    m <NA>

或者

 library(dplyr)
 library(tidyr)
  df1 %>%
     group_by(ID, Admitted) %>%
     mutate(ind=paste0('DX', 1:n())) %>%
     ungroup() %>% 
     spread(ind, DX)
   #    ID Age Admitted DX1 DX2 DX3
   #1  1  17   3/2/14   a   b  NA
   #2  1  17  4/30/14   c  NA  NA
   #3  2  20   2/4/14   d  NA  NA
   #4  2  20  7/22/13   e   a   c
   #5  3  16  4/18/14   b  NA  NA
   #6  4  16  10/8/13   e   m  NA

数据

df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 4L, 4L), 
DX = c("a", "b", "c", "e", "a", "c", "d", "b", "e", "m"), 
Age = c(17L, 17L, 17L, 20L, 20L, 20L, 20L, 16L, 16L, 16L), 
Admitted = c("3/2/14", "3/2/14", "4/30/14", "7/22/13", "7/22/13", 
"7/22/13", "2/4/14", "4/18/14", "10/8/13", "10/8/13")),
.Names =   c("ID", 
"DX", "Age", "Admitted"), class = "data.frame", row.names = c(NA, 
-10L))