R长到宽格式的因子水平,作为二进制变量和日期

时间:2018-11-30 09:45:58

标签: r format reshape dcast spread

我想使用长到宽的格式并将因子Levels用作二进制变量。这意味着,如果因子水平至少存在一次,则变量中应为1。否则为0。此外,我希望将日期作为变量值date.1,date.2,...

我所拥有的是以下

data_sample <- data.frame(
  PatID  = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
  date   = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
  status = c("COPD", "CPOD", "NA", "NA", "Cardio", "CPOD", "Cardio", "Cardio", "Cerebro")
)

我想要的是:

PatID  COPD Cardio Cerebro date.COPD.1 date.COPD.2 date.Cardio.1  date.Cardio.2  date.Cerebro.1
1        1    0       0    2016-12-14  2017-02-04     NA               NA          NA
2        0    1       0      NA           NA        2012-03-27         NA          NA 
3        1    1       1    2012-04-21     NA        2010-02-03    2011-03-05      2014-08-25      

1 个答案:

答案 0 :(得分:0)

需要采取一些步骤,但这应该可以为您提供所需的输出。

但是请注意,输入数据似乎有错别字:我认为您的意思是"COPD"而不是"CPOD",因为这是您期望的输出告诉我的。

第一步是使字符串"NA"成为明确的缺失值,即NA

data_sample[data_sample == "NA"] <- NA

现在使用data.table::dcast进行重塑。

library(data.table)  
setDT(data_sample)

# create id column
data_sample[, id := rowid(status), by = PatID]
dt1 <- dcast(data_sample[!is.na(date)], PatID ~ status, fun.aggregate = function(x) +any(x))
dt2 <- dcast(data_sample[!is.na(date)], PatID ~ paste0("date_", status) + id, value.var = "date")

最后连接两个data.tables

out <- dt1[dt2, on = 'PatID']
out
#  PatID Cardio Cerebro COPD date_COPD_1 date_COPD_2 date_Cardio_1 date_Cardio_2 date_Cerebro_1
#1:     1      0       0    1  2016-12-14  2017-02-04          <NA>          <NA>           <NA>
#2:     2      1       0    0        <NA>        <NA>    2012-27-03          <NA>           <NA>
#3:     3      1       1    1  2012-04-21        <NA>    2010-02-03    2011-03-05     2014-08-25

数据

data_sample <- data.frame(
  PatID   = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
  date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
  status =c("COPD", "COPD", "NA", "NA", "Cardio", "COPD", "Cardio", "Cardio", "Cerebro"))