我在R中有50列到250万行的数据帧,表示时间序列。时间列属于POSIXct类。为了进行分析,我反复需要在特定时间找到给定类的系统状态。
我目前的方法如下(简化和可重复):
set.seed(1)
N <- 10000
.time <- sort(sample(1:(100*N),N))
class(.time) <- c("POSIXct", "POSIXt")
df <- data.frame(
time=.time,
distance1=sort(sample(1:(100*N),N)),
distance2=sort(sample(1:(100*N),N)),
letter=sample(letters,N,replace=TRUE)
)
# state search function
time.state <- function(df,searchtime,searchclass){
# find all rows in between the searchtime and a while (here 10k seconds)
# before that
rows <- which(findInterval(df$time,c(searchtime-10000,searchtime))==1)
# find the latest state of the given class within the search interval
return(rev(rows)[match(T,rev(df[rows,"letter"]==searchclass))])
}
# evaluate the function to retrieve the latest known state of the system
# at time 500,000.
df[time.state(df,500000,"a"),]
然而,拨打which
的费用非常高。或者,我可以先按类过滤,然后找时间,但这不会更改评估时间。据Rprof说,which
和==
花费了大部分时间。
是否有更有效的解决方案?时间点排序微弱增加。
答案 0 :(得分:1)
由于which
,==
和[
都与数据框的大小成线性关系,因此解决方案是为批量操作生成子集数据框,如下所示:
# function that applies time.state to a series of time/class cominations
time.states <- function(df,times,classes,day.length=24){
result <- vector("list",length(times))
day.end <- 0
for(i in 1:length(times)){
if(times[i] > day.end){
# create subset interval from 1h before to 24h after
day.begin <- times[i]-60*60
day.end <- times[i]+day.length*60*60
df.subset <- df[findInterval(df$time,c(day.begin,day.end))==1,]
}
# save the resulting row from data frame
result[[i]] <- df.subset[time.state(df.subset,times[i],classes[i]),]
}
return(do.call("rbind",result))
}
dT=diff(range(df$times))
和dT/day.length
大,这会使评估时间缩短dT/(day.length+1)
。