我有一系列标记某些事件开始和结束的时间戳。
library(chron)
start <- structure(c(14246.3805439815, 14246.3902662037, 14246.3909606481,
14246.3992939815, 14246.4013773148, 14246.4034606481, 14246.4062384259,
14246.4069328704, 14246.4069328704, 14246.4097106481, 14246.4097106481,
14246.4104050926, 14246.4117939815, 14246.4117939815, 14246.4117939815,
14246.4145717593, 14246.4152546296, 14246.4152662037, 14246.4152662037,
14246.4159606481), format = structure(c("m/d/y", "h:m:s"), .Names = c("dates",
"times")), origin = structure(c(1, 1, 1970), .Names = c("month",
"day", "year")), class = c("chron", "dates", "times"))
finish <- structure(c(14246.436099537, 14246.4666550926, 14246.4083217593,
14246.4374884259, 14246.4847106481, 14246.4867939815, 14246.4305439815,
14246.4659606481, 14246.4520717593, 14246.9097106481, 14246.4930439815,
14246.4763773148, 14246.4326273148, 14246.4291550926, 14246.4187384259,
14246.9145717593, 14246.4395601852, 14246.4395717593, 14246.4395717593,
14246.4367939815), format = structure(c("m/d/y", "h:m:s"), .Names = c("dates",
"times")), origin = structure(c(1, 1, 1970), .Names = c("month",
"day", "year")), class = c("chron", "dates", "times"))
events <- data.frame(start, finish)
head(event, 5)
start finish
1 (01/02/09 09:07:59) (01/02/09 10:27:59)
2 (01/02/09 09:21:59) (01/02/09 11:11:59)
3 (01/02/09 09:22:59) (01/02/09 09:47:59)
4 (01/02/09 09:34:59) (01/02/09 10:29:59)
5 (01/02/09 09:37:59) (01/02/09 11:37:59)
我现在想要计算在特定时间戳上正在进行的事件数量。
intervals <- structure(c(14246.3958333333, 14246.40625, 14246.4166666667,
14246.4270833333, 14246.4375), format = structure(c("m/d/y",
"h:m:s"), .Names = c("dates", "times")), origin = structure(c(1,
1, 1970), .Names = c("month", "day", "year")), class = c("chron",
"dates", "times"))
intervals
[1] (01/02/09 09:30:00) (01/02/09 09:45:00) (01/02/09 10:00:00) (01/02/09 10:15:00) (01/02/09 10:30:00)
所以我想要的输出如下:
intervals count
1 (01/01/09 09:30:00) 3
2 (01/01/09 09:45:00) 7
3 (01/01/09 10:00:00) 19
4 (01/01/09 10:15:00) 18
5 (01/01/09 10:30:00) 12
虽然以编程方式解决问题很简单,但我希望以210,000个间隔和超过120万个事件完成此任务。我目前的方法是利用data.table
包和&
运算符来检查区间是否位于每个事件的开始和结束时间之间。
library(data.table)
events <- data.table(events)
data.frame(intervals, count = sapply(1:5, function(i) sum(events[, start <= intervals[i] & intervals[i] <= finish])))
但考虑到我的数据大小,这种方法需要很长时间才能运行。关于在R中实现这一目标的更好替代方案的任何建议?
干杯。
答案 0 :(得分:3)
R中快速执行代码的秘诀是将所有内容保存在矢量或数组中,这些实际上只是伪装的数组。
这是一种仅使用基本R阵列的解决方案。您的数据样本很小,因此我使用replicate
和system.time
来衡量效果。
我的解决方案比使用sapply
和data.table
的解决方案快大约6倍。 (我的解决方案需要0.6秒才能解决您的小样本数据集1,000次。)
定时解决方案
system.time(replicate(1000,
XX <- data.frame(
intervals,
count = sapply(1:5, function(i) sum(events[, start <= intervals[i] & intervals[i] <= finish])))
))
user system elapsed
4.04 0.05 4.11
我的解决方案。首先创建两个辅助函数来创建大小相等的数组,其中事件沿着列运行并且行间隔运行。然后进行简单的矢量比较,然后进行colSums
:
event.array <- function(x, interval){
len <- length(interval)
matrix(rep(unclass(x), len), ncol=len)
}
intervals.array <- function(x, intervals){
len <- length(x)
matrix(rep(unclass(intervals), len), nrow=len, byrow=TRUE)
}
a.start <- event.array(start, intervals)
a.finish <- event.array(finish, intervals)
a.intervals <- intervals.array(start, intervals)
data.frame(intervals,
count=colSums(a.start <= a.intervals & a.finish >= a.intervals))
intervals count
1 (01/02/09 09:30:00) 3
2 (01/02/09 09:45:00) 7
3 (01/02/09 10:00:00) 19
4 (01/02/09 10:15:00) 18
5 (01/02/09 10:30:00) 12
定时解决方案
system.time(replicate(1000,
YY <- data.frame(
intervals,
count=colSums(a.start <= a.intervals & a.finish >= a.intervals))
))
user system elapsed
0.67 0.02 0.69
all.equal(XX, YY)
[1] TRUE
答案 1 :(得分:0)
使用dim()
代替sum()
和ldply()
代替sapply()
可能会更快?
b<-function(i,df){ data.frame(i, count=dim(df[with(df, start<i & finish> i),])[1])};
ldply(intervals, b, events);
i count
1 14246.40 3
2 14246.41 7
3 14246.42 19
4 14246.43 18
5 14246.44 12
我不熟悉chron库所以我没有把i
作为时间戳出来。遗憾。