Question

我有一个类似于下面的数据集（DF）：

   ID DOB      Age Outcome    
   1  1/01/80  18     1
   1  1/01/80  18     0
   2  1/02/81  17     1
   2  1/02/81  17     0
   3  1/03/70  28     1

我想将数据库更改为宽格式，以便每个ID都有一行。但是，考虑到每个ID的DOB和Age相同，我希望这些变量在新数据库中是一个列，并且只有Outcome变量的多个列，如下所示：

   ID DOB      Age Outcome.1 Outcome.2    
   1  1/01/80  18     1         0
   2  1/02/81  17     1         0
   3  1/03/70  28     1         NA

我尝试过使用tidyr和reshape，但我似乎无法将数据库变成这种格式。例如，当我使用代码时：

spread(DF, key=ID, value = Outcome)

我收到一条错误消息，表明我有重复的行标识符。有没有办法让数据库成为我想要的格式？

感谢。

Answer 1

使用tidyverse执行以下步骤可以实现一个解决方案。我们的想法是将row number添加到列中，以便为每行提供唯一的ID。之后有不同的方式来应用spread。

df <- read.table(text = "ID DOB      Age Outcome    
1  1/01/80  18     1
1  1/01/80  18     0
2  1/02/81  17     1
2  1/02/81  17     0
3  1/03/70  28     1", header = T, stringsAsFactors = F)

library(tidyverse)

df %>% mutate(rownum = row_number(), Outcome = paste("Outcome",Outcome,sep=".")) %>%
  spread(Outcome, rownum) %>%
  mutate(Outcome.0 = ifelse(!is.na(Outcome.0),0, NA )) %>%
  mutate(Outcome.1 = ifelse(!is.na(Outcome.1),1, NA ))

# Result:
#  ID     DOB Age Outcome.0 Outcome.1
#1  1 1/01/80  18         0         1
#2  2 1/02/81  17         0         1
#3  3 1/03/70  28        NA         1

Answer 2

dcast功能用于这样的事情。

dcast(data, ID + DOB + Age ~ Outcome)

Answer 3

您可以使用tidyr和dplyr：

   DF %>%
      group_by(ID) %>%
      mutate(OutcomeID = paste0('Outcome.', row_number())) %>%
      spread(OutcomeID, Outcome)

通过分组变量r传播二进制变量

3 个答案: