计算时间戳正在进行的事件数

时间:2011-08-26 11:10:23

标签: r

我有一系列标记某些事件开始和结束的时间戳。

library(chron)
start <- structure(c(14246.3805439815, 14246.3902662037, 14246.3909606481, 
14246.3992939815, 14246.4013773148, 14246.4034606481, 14246.4062384259, 
14246.4069328704, 14246.4069328704, 14246.4097106481, 14246.4097106481, 
14246.4104050926, 14246.4117939815, 14246.4117939815, 14246.4117939815, 
14246.4145717593, 14246.4152546296, 14246.4152662037, 14246.4152662037, 
14246.4159606481), format = structure(c("m/d/y", "h:m:s"), .Names = c("dates", 
"times")), origin = structure(c(1, 1, 1970), .Names = c("month", 
"day", "year")), class = c("chron", "dates", "times"))

finish <- structure(c(14246.436099537, 14246.4666550926, 14246.4083217593, 
14246.4374884259, 14246.4847106481, 14246.4867939815, 14246.4305439815, 
14246.4659606481, 14246.4520717593, 14246.9097106481, 14246.4930439815, 
14246.4763773148, 14246.4326273148, 14246.4291550926, 14246.4187384259, 
14246.9145717593, 14246.4395601852, 14246.4395717593, 14246.4395717593, 
14246.4367939815), format = structure(c("m/d/y", "h:m:s"), .Names = c("dates", 
"times")), origin = structure(c(1, 1, 1970), .Names = c("month", 
"day", "year")), class = c("chron", "dates", "times"))

events <- data.frame(start, finish)
head(event, 5)

                start              finish
1 (01/02/09 09:07:59) (01/02/09 10:27:59)
2 (01/02/09 09:21:59) (01/02/09 11:11:59)
3 (01/02/09 09:22:59) (01/02/09 09:47:59)
4 (01/02/09 09:34:59) (01/02/09 10:29:59)
5 (01/02/09 09:37:59) (01/02/09 11:37:59)

我现在想要计算在特定时间戳上正在进行的事件数量。

intervals <- structure(c(14246.3958333333, 14246.40625, 14246.4166666667, 
14246.4270833333, 14246.4375), format = structure(c("m/d/y", 
"h:m:s"), .Names = c("dates", "times")), origin = structure(c(1, 
1, 1970), .Names = c("month", "day", "year")), class = c("chron", 
"dates", "times"))

intervals

[1] (01/02/09 09:30:00) (01/02/09 09:45:00) (01/02/09 10:00:00) (01/02/09 10:15:00) (01/02/09 10:30:00)

所以我想要的输出如下:

            intervals count
1 (01/01/09 09:30:00)     3
2 (01/01/09 09:45:00)     7
3 (01/01/09 10:00:00)    19
4 (01/01/09 10:15:00)    18
5 (01/01/09 10:30:00)    12

虽然以编程方式解决问题很简单,但我希望以210,000个间隔和超过120万个事件完成此任务。我目前的方法是利用data.table包和&运算符来检查区间是否位于每个事件的开始和结束时间之间。

library(data.table)
events <- data.table(events)
data.frame(intervals, count = sapply(1:5, function(i) sum(events[, start <= intervals[i] & intervals[i] <= finish])))

但考虑到我的数据大小,这种方法需要很长时间才能运行。关于在R中实现这一目标的更好替代方案的任何建议?

干杯。

2 个答案:

答案 0 :(得分:3)

R中快速执行代码的秘诀是将所有内容保存在矢量或数组中,这些实际上只是伪装的数组。

这是一种仅使用基本R阵列的解决方案。您的数据样本很小,因此我使用replicatesystem.time来衡量效果。

我的解决方案比使用sapplydata.table的解决方案快大约6倍。 (我的解决方案需要0.6秒才能解决您的小样本数据集1,000次。)

定时解决方案

system.time(replicate(1000, 
    XX <- data.frame(
      intervals, 
      count = sapply(1:5, function(i) sum(events[, start <= intervals[i] & intervals[i] <= finish])))
))

   user  system elapsed 
   4.04    0.05    4.11 

我的解决方案。首先创建两个辅助函数来创建大小相等的数组,其中事件沿着列运行并且行间隔运行。然后进行简单的矢量比较,然后进行colSums

event.array <- function(x, interval){
  len <- length(interval)
  matrix(rep(unclass(x), len), ncol=len)
}

intervals.array <- function(x, intervals){
  len <- length(x)
  matrix(rep(unclass(intervals), len), nrow=len, byrow=TRUE)
} 


a.start <- event.array(start, intervals)
a.finish <- event.array(finish, intervals)
a.intervals <- intervals.array(start, intervals)

data.frame(intervals, 
           count=colSums(a.start <= a.intervals & a.finish >= a.intervals))

            intervals count
1 (01/02/09 09:30:00)     3
2 (01/02/09 09:45:00)     7
3 (01/02/09 10:00:00)    19
4 (01/02/09 10:15:00)    18
5 (01/02/09 10:30:00)    12

定时解决方案

system.time(replicate(1000, 
  YY <- data.frame(
          intervals, 
          count=colSums(a.start <= a.intervals & a.finish >= a.intervals))
))

   user  system elapsed 
   0.67    0.02    0.69 

all.equal(XX, YY)
[1] TRUE

答案 1 :(得分:0)

使用dim()代替sum()ldply()代替sapply()可能会更快?

b<-function(i,df){ data.frame(i, count=dim(df[with(df, start<i & finish> i),])[1])};
ldply(intervals, b, events);

         i count
1 14246.40     3
2 14246.41     7
3 14246.42    19
4 14246.43    18
5 14246.44    12

我不熟悉chron库所以我没有把i作为时间戳出来。遗憾。