基于日期在数据框中重复行

时间:2017-02-04 15:07:26

标签: r function dataframe duplicates apply

寻找以下问题的简单解决方案 这就是我的数据的外观:

ClientID    PatientID   Measure Value   CollectionDatetime
41  123456  Temperature           87    02-04-2017
41  123456  WBC                 1000    02-04-2017
41  123456  Temperature           83    02-05-2017
41  23456   WBC                10000    02-04-2017
41  23456   RR                   100    02-04-2017
41  23456   C-Ceratine            90    02-05-2017
41  23456   Temperature           87    02-06-2017
41  23456   Temperature           89    02-06-2017

这就是我想要输出的方式:

ClientID     PatientID  Measure Value   CollectionDatetime  Label
41  123456  Temperature            87   02-04-2017            1
41  123456  WBC                  1000   02-04-2017            1
41  123456  Temperature            87   02-04-2017            2
41  123456  WBC                  1000   02-04-2017            2
41  123456  Temperature            83   02-05-2017            2
41  23456   WBC                 10000   02-04-2017            1
41  23456   RR                    100   02-04-2017            1
41  23456   WBC                 10000   02-04-2017            2
41  23456   RR                    100   02-04-2017            2
41  23456   C-Ceratine             90   02-05-2017            2
41  23456   WBC                 10000   02-04-2017            3
41  23456   RR                    100   02-04-2017            3
41  23456   C-Ceratine             90   02-05-2017            3
41  23456   Temperature            87   02-06-2017            3
41  23456   Temperature            89   02-06-2017            3

应根据患者ID和CollectionDatetime复制数据。 对于每个患者ID,如果是第1天,第2天应该有第1天和第2天的数据,依此类推

1 个答案:

答案 0 :(得分:0)

使用data.table - 包:

# load the data.table package & convert 'dat' to a data.table
library(data.table)
setDT(dat)

# create the 'lbl' variable and the number of times each row needs to be repeated

dat[, lbl := rleid(CollectionDatetime), PatientID
    ][, reps := abs(lbl - max(lbl)), PatientID]

# create a 2nd data.table with the repeated rows
# make a sequence for each replication
# add that to 'lbl' to get correct 'lbl'

d2 <- dat[rep(1:nrow(dat), reps)][, lbl := lbl + 1:max(reps), .(PatientID,lbl)]

# bind the original data.table and the new together
# remove 'reps' column (no longer needed)
# and order to match the expected output

rbindlist(list(dat,d2))[, reps := NULL][order(-PatientID,lbl,CollectionDatetime)]

给出:

    ClientID PatientID     Measure Value CollectionDatetime lbl
 1:       41    123456 Temperature    87         2017-02-04   1
 2:       41    123456         WBC  1000         2017-02-04   1
 3:       41    123456 Temperature    87         2017-02-04   2
 4:       41    123456         WBC  1000         2017-02-04   2
 5:       41    123456 Temperature    83         2017-02-05   2
 6:       41     23456         WBC 10000         2017-02-04   1
 7:       41     23456          RR   100         2017-02-04   1
 8:       41     23456         WBC 10000         2017-02-04   2
 9:       41     23456          RR   100         2017-02-04   2
10:       41     23456  C-Ceratine    90         2017-02-05   2
11:       41     23456         WBC 10000         2017-02-04   3
12:       41     23456          RR   100         2017-02-04   3
13:       41     23456  C-Ceratine    90         2017-02-05   3
14:       41     23456 Temperature    87         2017-02-06   3
15:       41     23456 Temperature    89         2017-02-06   3

您可以在基础R中实现相同的目标:

dat$lbl <- with(dat, ave(as.numeric(CollectionDatetime), PatientID, FUN = function(x) cumsum(c(1, diff(x) > 0))))
dat$reps <- with(dat, ave(lbl, PatientID, FUN = function(x) abs(x - max(x))))

dat2 <- dat[rep(1:nrow(dat), dat$reps),]
dat2$lbl <- dat2$lbl + with(dat2, ave(reps, cumsum(c(0,abs(diff(dat2$reps)))), FUN = function(x) 1:max(x)))

d <- rbind(dat,dat2)[,-7]
d[order(-d$PatientID,d$lbl,d$CollectionDatetime),]

使用过的数据:

dat <- structure(list(ClientID = c(41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L), 
                      PatientID = c(123456L, 123456L, 123456L, 23456L, 23456L, 23456L, 23456L, 23456L), 
                      Measure = structure(c(3L, 4L, 3L, 4L, 2L, 1L, 3L, 3L), .Label = c("C-Ceratine", "RR", "Temperature", "WBC"), class = "factor"), 
                      Value = c(87L, 1000L, 83L, 10000L, 100L, 90L, 87L, 89L), 
                      CollectionDatetime = structure(c(17201, 17201, 17202, 17201, 17201, 17202, 17203, 17203), class = "Date")), 
                 .Names = c("ClientID", "PatientID", "Measure", "Value", "CollectionDatetime"), row.names = c(NA, -8L), class = "data.frame")