两个密切相关的帖子是here和here。我无法将这些中的任何一个翻译成我的确切情况。
这是一个时代的载体:
start.time = as.POSIXct("2013-06-20 01:00:00")
x = start.time + runif(5, min = 0, max = 8*60)
x = x[order(x)]
x
# [1] "2013-06-20 01:00:30 EDT" "2013-06-20 01:00:57 EDT"
# [3] "2013-06-20 01:01:43 EDT" "2013-06-20 01:04:01 EDT"
# [5] "2013-06-20 01:04:10 EDT"
接下来,这是一个两分钟标记的向量:
y = seq(as.POSIXct("2013-06-20 01:00:00"), as.POSIXct("2013-06-20 01:06:00"), 60*2)
y
# [1] "2013-06-20 01:00:00 EDT" "2013-06-20 01:02:00 EDT"
# [3] "2013-06-20 01:04:00 EDT" "2013-06-20 01:06:00 EDT"
我想快速,灵活,可扩展的方式来生成x
元素的计数,这些元素落入y
每个元素右侧的两分钟区间,就像这样:
y count.x
1 2013-06-20 01:00:00 3
2 2013-06-20 01:02:00 0
3 2013-06-20 01:04:00 2
4 2013-06-20 01:06:00 0
答案 0 :(得分:3)
怎么样
as.data.frame(table(cut(x, breaks=c(y, Inf))))
Var1 Freq
1 2013-06-20 01:00:00 3
2 2013-06-20 01:02:00 0
3 2013-06-20 01:04:00 2
4 2013-06-20 01:06:00 0
答案 1 :(得分:0)
这是一个解决问题的函数,的运行速度比table(cut(...))
快得多:
get.bin.counts = function(x, name.x = "x", start.pt, end.pt, bin.width){
br.pts = seq(start.pt, end.pt, bin.width)
x = x[(x >= start.pt)&(x <= end.pt)]
counts = hist(x, breaks = br.pts, plot = FALSE)$counts
dfm = data.frame(br.pts[-length(br.pts)], counts)
names(dfm) = c(name.x, "freq")
return(dfm)
}
这里的关键线位于中间 - counts = hist(...
。将绘图选项设置为hist
的{{1}}函数至关重要。
为测试此功能的速度性能,我按如下方式运行:
FALSE
通过这个示例,我的函数比# First define x, a large vector of times:
start.time = as.POSIXct("2012-11-01 00:00:00")
x = start.time + runif(50000, min = 0, max = 365*24*3600)
x = x[order(x)]
# Apply the function, keeping track of running time:
t1 = Sys.time()
dfm = get.bin.counts(x, name.x = "time",
start.pt = as.POSIXct("2012-11-01 00:00:00"),
end.pt = as.POSIXct("2013-07-01 00:00:00"),
bin.width = 60)
as.numeric(Sys.time()-t1) #prints elapsed time
的运行速度快了10倍。信用归因于table(cut(...))
help page,它指出“而不是{ {1}},cut
效率更高,内存更少。“