我想使用长到宽的格式并将因子Levels用作二进制变量。这意味着,如果因子水平至少存在一次,则变量中应为1。否则为0。此外,我希望将日期作为变量值date.1,date.2,...
我所拥有的是以下
data_sample <- data.frame(
PatID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
status = c("COPD", "CPOD", "NA", "NA", "Cardio", "CPOD", "Cardio", "Cardio", "Cerebro")
)
我想要的是:
PatID COPD Cardio Cerebro date.COPD.1 date.COPD.2 date.Cardio.1 date.Cardio.2 date.Cerebro.1
1 1 0 0 2016-12-14 2017-02-04 NA NA NA
2 0 1 0 NA NA 2012-03-27 NA NA
3 1 1 1 2012-04-21 NA 2010-02-03 2011-03-05 2014-08-25
答案 0 :(得分:0)
需要采取一些步骤,但这应该可以为您提供所需的输出。
但是请注意,输入数据似乎有错别字:我认为您的意思是"COPD"
而不是"CPOD"
,因为这是您期望的输出告诉我的。
第一步是使字符串"NA"
成为明确的缺失值,即NA
。
data_sample[data_sample == "NA"] <- NA
现在使用data.table::dcast
进行重塑。
library(data.table)
setDT(data_sample)
# create id column
data_sample[, id := rowid(status), by = PatID]
dt1 <- dcast(data_sample[!is.na(date)], PatID ~ status, fun.aggregate = function(x) +any(x))
dt2 <- dcast(data_sample[!is.na(date)], PatID ~ paste0("date_", status) + id, value.var = "date")
最后连接两个data.tables
out <- dt1[dt2, on = 'PatID']
out
# PatID Cardio Cerebro COPD date_COPD_1 date_COPD_2 date_Cardio_1 date_Cardio_2 date_Cerebro_1
#1: 1 0 0 1 2016-12-14 2017-02-04 <NA> <NA> <NA>
#2: 2 1 0 0 <NA> <NA> 2012-27-03 <NA> <NA>
#3: 3 1 1 1 2012-04-21 <NA> 2010-02-03 2011-03-05 2014-08-25
数据
data_sample <- data.frame(
PatID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
date = c("2016-12-14", "2017-02-04", "NA", "NA", "2012-27-03", "2012-04-21", "2010-02-03", "2011-03-05", "2014-08-25"),
status =c("COPD", "COPD", "NA", "NA", "Cardio", "COPD", "Cardio", "Cardio", "Cerebro"))