我有两个这样的大数据集:
df1=data.frame(subject = c(rep(1, 12), rep(2, 10)), day =c(1,1,1,1,1,2,3,15,15,15,15,19,1,1,1,1,2,3,15,15,15,15),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/16/2012 17:22','4/16/2012 17:45','4/16/2012 18:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/1/2012 16:28','5/1/2012 17:00','5/5/2012 17:00','4/23/2012 5:56','4/23/2012 6:30','4/23/2012 16:55','4/23/2012 17:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/8/2012 15:55','5/8/2012 16:30'))
df2=data.frame(subject = c(rep(1, 10), rep(2, 10)), day=c(1,1,2,2,3,3,9,9,15,15,1,1,2,2,3,3,9,9,15,15),dtime=c('4/16/2012 6:15','4/16/2012 15:16','4/18/2012 7:15','4/18/2012 21:45','4/19/2012 7:05','4/19/2012 23:17','4/28/2012 7:15','4/28/2012 21:12','5/1/2012 7:15','5/1/2012 15:15','4/23/2012 6:45','4/23/2012 16:45','4/25/2012 6:45','4/25/2012 21:30','4/26/2012 6:45','4/26/2012 22:00','5/2/2012 7:00','5/2/2012 22:00','5/8/2012 6:45','5/8/2012 15:45'))
...
在df2中,' dtime'每天包含两个时间点。我想在df1中每天使用每个sub的时间点(即' stime')在df2中每天减去每个sub的第二个时间点,如果结果为正,则给出dtime中第二个时间点用于该观察,否则给出第一个时间点。例如,对于第1天的主题1(' 2012年4月16日6:25' - ' 4/16/2012 15:16')< 0,我们给出第一个时间点' 2012/4/16 6:15'对这个障碍; (' 2012年4月16日17:22' - ' 2012年4月16日15:16')> 0, 所以我们给出第二个时间点< 4/16/2012 15:16'对这个障碍。预期的输出应如下所示:
df3=data.frame(subject = c(rep(1, 12), rep(2, 10)), day =c(1,1,1,1,1,2,3,15,15,15,15,19,1,1,1,1,2,3,15,15,15,15),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/16/2012 17:22','4/16/2012 17:45','4/16/2012 18:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/1/2012 16:28','5/1/2012 17:00','5/5/2012 17:00','4/23/2012 5:56','4/23/2012 6:30','4/23/2012 16:55','4/23/2012 17:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/8/2012 15:55','5/8/2012 16:30'), dtime=c('4/16/2012 6:15','4/16/2012 6:15','4/16/2012 15:16','4/16/2012 15:16','4/16/2012 15:16','4/18/2012 7:15','4/19/2012 7:05','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 15:15','5/1/2012 15:15','.','4/23/2012 6:45','4/23/2012 6:45','4/23/2012 16:45','4/23/2012 16:45','4/25/2012 6:45','4/26/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 15:45','5/8/2012 15:45'))
...
我使用下面的代码来实现这一点,但是,由于缺少了“dtime&time”。在第19天,R一直给我错误:
df1$dtime <- apply(df1, 1, function(x){
choices <- df2[ df2$subject==as.numeric(x["subject"]) &
df2$day==as.numeric(x["day"]) , "dtime"]
if( as.POSIXct(x["stime"], format="%m/%d/%Y %H:%M") <
as.POSIXct(choices[2],format="%m/%d/%Y %H:%M") ) {
choices[1]
}else{ choices[2] }
} )
Error in if (as.POSIXct(x["stime"], format = "%m/%d/%Y %H:%M") < as.POSIXct(choices[2], : missing value where TRUE/FALSE needed
由于我的数据集很大(大约15,000行和30列),因此有一些缺失的&d; dtime&#39;在df2。有谁知道如何解决这个问题?
答案 0 :(得分:0)
我认为这可能适合你:
df1$dtime <- apply(df1, 1, function(x) {
choices <- as.character(df2[ df2$subject==as.numeric(x["subject"]) &
df2$day==as.numeric(x["day"]), "dtime"])
t1 <- as.POSIXct(as.character(x["stime"]), format="%m/%d/%Y %H:%M")
t2 <- as.POSIXct(choices[2], format="%m/%d/%Y %H:%M")
return(ifelse(( t1 < t2 ), choices[1], choices[2]))
})