如何在time滞后定义的data.table行之间搜索条件?

时间:2015-12-12 17:00:36

标签: r data.table

我有处方记录的数据集。每行是特定日期单一药物的处方。我将药物分成两组,部分重叠。我想确定两个药物组在相隔3天内处方的处方,但不包括从第1组和第2组发出相同药物的地方,以确定该药物的后一种药物的日期。

我的数据示例:

library(data.table)
set.seed(10)
DT <- data.table(day = sample(c(1:31), 30, replace = TRUE),
                 drug_group = sample(c(1, 2), 30, replace = TRUE),
                 drug_1 = sample(c("A", "B", "C"), 30, replace = TRUE),
                 drug_2 = sample(c("A", "D", "E"), 30, replace = TRUE))
DT[drug_group == 1, drug := drug_1]
DT[drug_group == 2, drug := drug_2]
DT[, c("drug_1", "drug_2") := NULL]
setkey(DT, day)

以下内容:

    day drug_group drug
 1:   2          1    B
 2:   3          1    C
 3:   4          1    B
 4:   7          2    E
 5:   8          1    A
 6:   9          2    A
 7:   9          2    D
 8:   9          1    C
 9:  10          1    A
10:  12          1    A
...
24:  22          2    D
25:  22          2    E
26:  24          1    A
27:  25          1    A
28:  26          2    D
29:  26          1    C
30:  27          1    C

我想得到这样的结果:

    day interaction_present
 1:   1       FALSE
 2:   2       FALSE
 3:   3       FALSE
 4:   4       FALSE
 5:   5       FALSE
 6:   6       FALSE
 ...
26:  26       TRUE
29:  29       FALSE
30:  30       FALSE

我很确定我可以通过循环遍历每一行来做到这一点,但我已经被反复告诫使用循环而不是矢量化,我想知道这种类型的任务是否可行而没有循环?我已经看过使用data.table shift()函数来设置滞后但我担心创建太多新列,因为我的实际data.table超过一百万行。

很抱歉,如果这是一个微不足道的问题,或者之前有人问过,但是我整个下午都被困在了这一天,我就放弃了这一天!

1 个答案:

答案 0 :(得分:-1)

如果我正确理解您的问题,以下内容应该有效。可以轻松地对函数内的逻辑决策以及时间滞后startend变量进行任何更改。

timelagadj <- function(i){

  ## this should be changed depending on what you mean by "within"
  ## 3 days. This currently goes i-3,i+3
  ## but if i-3 or i+3 doesn't exist in dayDT$day
  ## then we pull the value one above/below i-3/i+3 respectively
  start <- max(dayDT$day[i]-3,dayDT$day[1])
  start <- ifelse(dayDT$day[findInterval(start,dayDT$day)] < start,
            findInterval(start,dayDT$day)+1,
              findInterval(start,dayDT$day))
  end <- min(dayDT$day[i]+3,dayDT$day[nrow(dayDT)])
  end <- findInterval(end,dayDT$day)

  ## now we pull the relevant group ID and drug ID
  gIDs <- dayDT$groupID[start:end]
  dIDs <- dayDT$drugID[start:end]

  ## here we unlist the paste made before
  ## to group by days
  gIDs <- unlist(strsplit(gIDs,"_"))
  dIDs <- unlist(strsplit(dIDs,"_"))

  ## now we can apply our logic rule based
  ## on the criteria you mentioned

  if(length(unique(gIDs))>1){
    tmp <- unique(data.frame(gIDs,dIDs))
    if(length(unique(tmp$gIDs))!=length(unique(tmp$dIDs))) T else F

  }else F

}

dayDT <- DT[,list("drugID"=paste(drug,collapse="_"),"groupID"=paste(drug_group,collapse="_")),by=day]
res <- sapply(1:nrow(dayDT),function(m) timelagadj(m))
res <- dayDT[,list(day,"interaction_present"=res)]