R:使用#34;()"中的逻辑语句在data.table中分配变量在ifelse函数中

时间:2015-03-21 15:20:37

标签: r data.table

在问题Defining variable by logical subseting on time interval in data.table中,我请求帮助根据事件之间的时间代码(即event==1event==2)来分配“状态”变量。

该解决方案使用ifelse函数,其中逻辑测试检查时间变量是否在起始点和结束点的时间值之间。

问题是我是否想要在ifelse函数中对逻辑语句进行分组。首先评估和OR语句,然后评估AND语句。为了保密,我有以下data.table

# Defining variables and data.table
id <- rep(LETTERS[1:3],each=5)
set.seed(123)
event <- c(sample(c(0,1),2,F),sample(c(0,0,2),3,F),
           sample(c(0,1),2,F),sample(c(0,0,2),3,F),
           sample(c(0,1),2,F),sample(c(0,0,2),3,F))
event[event==2] <- sample(c(2,3),3,T)
state <- "NULL"
time <- c(apply(matrix(runif(3*5),5,3),2,cumsum))
DT <- data.table(id,event,state,time) 
DT[14,] <- DT[13,]
DT[14,event:=3]

产生这个data.table

    id event state      time
 1:  A     0  NULL 0.3279207
 2:  A     1  NULL 1.2824244
 3:  A     0  NULL 2.1719637
 4:  A     3  NULL 2.8647671  <- Event 2 or 3 marks the end point
 5:  A     0  NULL 3.5052739
 6:  B     0  NULL 0.9942698
 7:  B     1  NULL 1.6499756
 8:  B     2  NULL 2.3585060  <- Event 2 or 3 marks the end point
 9:  B     0  NULL 2.9025721
10:  B     0  NULL 3.4967141
11:  C     1  NULL 0.2891597
12:  C     0  NULL 0.4362734
13:  C     2  NULL 1.3992976  <- Here both 2 and 3 appear at the same endpoint 
14:  C     3  NULL 1.3992976  <- Here both 2 and 3 appear at the same endpoint 
15:  C     0  NULL 2.9923019

我想为开始事件(event==1)和结束点(event==2event==3或两者之间的所有观察值的状态变量赋值1。所以正确的结果如下所示:

    id event state      time
 1:  A     0  NULL 0.3279207
 2:  A     1     1 1.2824244
 3:  A     0     1 2.1719637
 4:  A     3     1 2.8647671
 5:  A     0  NULL 3.5052739
 6:  B     0  NULL 0.9942698
 7:  B     1     1 1.6499756
 8:  B     2     1 2.3585060
 9:  B     0  NULL 2.9025721
10:  B     0  NULL 3.4967141
11:  C     1     1 0.2891597
12:  C     0     1 0.4362734
13:  C     2     1 1.3992976
14:  C     3     1 1.3992976
15:  C     0  NULL 2.9923019

我的第一次尝试就是这段代码:

DT[,state:=ifelse(time>=time[event==1] & (time<=time[event==2] | time<=time[event==3]),1,state),by=id]

,它给出以下错误消息:

Error in `[.data.table`(DT, , `:=`(state, ifelse(time >= time[event ==  : 
Type of RHS ('logical') must match LHS ('character'). To check and coerce would 
impact performance too much for the fastest cases. Either change the type of the target 
column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)

这行代码产生正确的结果

DT[,state:=ifelse(time>=time[event==1] & time<=time[event==2 | event==3],1,state),by=id]

但是当逻辑语句time<=time[event==2 | event==3]的长度大于1时会产生警告。所以它不是一个优雅的解决方案,因为它看起来像是一个错误。

如何将值1分配给状态变量,如果时间在起点和终点之间,那么终点由OR语句定义,就像我第一次尝试一样。

非常感谢。

3 个答案:

答案 0 :(得分:3)

我不熟悉data.table,所以可能有更好的方法。

DT[, rows:=1:.N , by=id][
   , state:=ifelse(rows >= which(event==1) & rows <= max(which(event==2), which(event==3)), 1, state), by=id]
DT
    id event state      time rows
 1:  A     0  NULL 0.3279207    1
 2:  A     1     1 1.2824244    2
 3:  A     0     1 2.1719637    3
 4:  A     3     1 2.8647671    4
 5:  A     0  NULL 3.5052739    5
 6:  B     0  NULL 0.9942698    1
 7:  B     1     1 1.6499756    2
 8:  B     2     1 2.3585060    3
 9:  B     0  NULL 2.9025721    4
10:  B     0  NULL 3.4967141    5
11:  C     1     1 0.2891597    1
12:  C     0     1 0.4362734    2
13:  C     2     1 1.3992976    3
14:  C     3     1 1.3992976    4
15:  C     0  NULL 2.9923019    5

答案 1 :(得分:3)

第一次尝试失败的原因是time[event==2]time[event==3]在实际只发生一个事件时评估为numeric(0)

DT[id=='A', time[event==2]]
## numeric(0)

解决此问题的最简单方法是采取例如最多两次:time <= max(time[event %in% 2:3])

DT[, state := ifelse(time >= time[event==1] & time <= max(time[event %in% 2:3]), 1, state), by=id]
DT
##     id event state      time
##  1:  A     0  NULL 0.3279207
##  2:  A     1     1 1.2824244
##  3:  A     0     1 2.1719637
##  4:  A     3     1 2.8647671
##  5:  A     0  NULL 3.5052739
##  6:  B     0  NULL 0.9942698
##  7:  B     1     1 1.6499756
##  8:  B     2     1 2.3585060
##  9:  B     0  NULL 2.9025721
## 10:  B     0  NULL 3.4967141
## 11:  C     1     1 0.2891597
## 12:  C     0     1 0.4362734
## 13:  C     2     1 1.3992976
## 14:  C     3     1 1.3992976
## 15:  C     0  NULL 2.9923019

答案 2 :(得分:1)

你可以解决它定义两个新列。

DT[, segment := cumsum(event == 1)]
DT[, keep := cumsum(c(1, event[-.N]) %in% c(2, 3)) < 1, by = segment]
DT[segment == 0, keep := FALSE]
DT[keep == TRUE, state := 1]
DT[, segment := NULL]
DT[, keep := NULL]