Question

我想运行离散时间模拟（下面是简化版）。我生成了一个人口成员数据框（每行一个成员），其中包含进入和退出网站的时间戳。然后，我希望在每个时间间隔计算网站上有多少成员。

目前，我正在循环计算时间，并在每一秒计算已进入但尚未退出的成员数量。（我还尝试通过在每个时间间隔删除退出的成员进行破坏性迭代，这需要更长时间。我也理解我可以在循环中使用更大的时间间隔。）

如何使用线性代数消除for循环和超额运行时？我目前的方法在人口增长时不能很好地扩展，当然它在持续时间方面是线性的。

popSize = 10000
simDuration = 10000
enterTimestamp <- rexp(n = popSize, rate = .001)
exitTimestamp <- enterTimestamp + rexp(n = popSize, rate = .001)
popEvents <- data.frame(cbind(enterTimestamp,exitTimestamp))
visitorLoad <- integer(length = simDuration)
for (i in 1:simDuration) {
  visitorLoad[i] <- sum(popEvents$enterTimestamp <= i & 
                        popEvents$exitTimestamp > i)  
  if (i %% 100 == 0) {print(paste('Sim at',i,'out of',simDuration,
                      'seconds.',sep=' ') )}
}
plot(visitorLoad, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')

Answer 1

您可以获取在不同时间进入和退出的访客数量，然后使用累计金额计算特定时间的访客数量。这似乎符合您对代码运行的要求，尽管它不使用线性代数。

diffs = rep(0, simDuration+1)

# Store the number of times a visitor enters and exits at each timestep. The table
# will contain headers that are the timesteps and values that are the number of
# people entering or exiting at the timestep.
tabEnter = table(pmax(1, ceiling(enterTimestamp)))
tabExit = table(pmin(simDuration+1, ceiling(exitTimestamp)))

# For each time index, add the number of people entering and subtract the number of
# people exiting. For instance, if in period 20, 3 people entered and 4 exited, then
# diffs[20] equals -1. as.numeric(names(tabEnter)) is the periods for which at least
# one person entered, and tabEnter is the number of people in each of those periods.
diffs[as.numeric(names(tabEnter))] = diffs[as.numeric(names(tabEnter))] + tabEnter
diffs[as.numeric(names(tabExit))] = diffs[as.numeric(names(tabExit))] - tabExit

# cumsum() sums the diffs vector through a particular time point. 
visitorLoad2 = head(cumsum(diffs), simDuration)

Answer 2

为简单起见，这是怎么回事：

vl<-unlist(lapply(1:simDuration,function(i)sum((enterTimestamp<=i)*(exitTimestamp>i))))
plot(vl, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')

它的速度是当前循环速度的两倍，但如果性能更重要，那么@josilber的解决方案更好，或者可能有data.table()的东西，会有一个想法......

编辑 - 速度如何：

require(data.table)
require(plyr) # for count() function

system.time({

enter<-data.table(count(ceiling(enterTimestamp))) # entries grouped by second
exit<-data.table(count(ceiling(exitTimestamp)))   # exits grouped by second
sim<-data.table(x=1:simDuration)                  # time index
merged<-merge(merge(sim,enter,by="x",all.x=T),exit,by="x",all.x=T)
mat<-data.matrix(merged[,list(freq.x,freq.y)])    # make matrix to remove NAs
mat[is.na(mat)]<-0                                # remove NAs, there are quicker ways but more complicated
vl<-cumsum(mat[,1]-mat[,2])                       # cumsum() to roll up the movements

})

user  system elapsed 
0.02    0.00    0.02 

plot(vl, typ = 'l', ylab = 'Visitor Load', xlab='Time Elapsed (sec)')

**进一步编辑** - 性能和简单性的平衡

system.time(cumsum(data.frame(table(cut(enterTimestamp,0:10000))-table(cut(exitTimestamp,0:10000)))[,2]))
user  system elapsed 
0.09    0.00    0.10

如何在R中矢量化比较而不是for-loop？

2 个答案: