Question

我正在迭代POSIX序列，以确定给定时间的并发事件数量，并完全按照此问题中描述的方法和相应的答案：

How to count the number of concurrent users using time interval data?

我的问题是我在几分钟内的tinterval序列涵盖了一年，这意味着它有 523.025条目。此外，我也在考虑在几秒钟内解决问题，这会使思考更加糟糕。

我能做些什么来改进此代码（例如，相关输入数据（tdata）的日期间隔的顺序是什么？）或者我是否必须接受性能如果我喜欢在R？

中找到解决方案

Answer 1

您可以尝试使用data.tables new foverlaps函数。使用另一个问题的数据：

library(data.table)
setDT(tdata)
setkey(tdata, start, end)
minutes <- data.table(start = seq(trunc(min(tdata[["start"]]), "mins"), 
                                  round(max(tdata[["end"]]), "mins"), by="min"))
minutes[, end := start+59]
setkey(minutes, start, end)
DT <- foverlaps(tdata, minutes, type="any")
counts <- DT[, .N, by=start]
plot(N~start, data=counts, type="s")

resulting plot

我还没有为大数据计时。试试吧。

Answer 2

这是另一种应该比处理列表更快的方法。它依赖于data.table联接和lubridate最近分钟的分箱时间。它还假设在开始录制之前有0个用户，但这可以通过在末尾向concurrent添加一个常数来修复：

library(data.table)
library(lubridate)

td <- data.table(start=floor_date(tdata$start, "minute"),
                 end=ceiling_date(tdata$end, "minute"))

# create vector of all minutes from start to end
# about 530K for a whole year
time.grid <- seq(from=min(td$start), to=max(td$end), by="min")
users <- data.table(time=time.grid, key="time")

# match users on starting time and 
# sum matches by start time to count multiple loging in same minute
setkey(td, start)
users <- td[users, 
          list(started=!is.na(end)), 
          nomatch=NA, 
          allow.cartesian=TRUE][, list(started=sum(started)), 
                                by=start]

# match users on ending time, essentially the same procedure
setkey(td, end)
users <- td[users, 
            list(started, ended=!is.na(start)), 
            nomatch=NA, 
            allow.cartesian=TRUE][, list(started=sum(started), 
                                         ended=sum(ended)), 
                                  by=end]

# fix timestamp column name
setnames(users, "end", "time")

# here you can exclude all entries where both counts are zero
# for a sparse representation
users <- users[started > 0 | ended > 0]

# last step, take difference of cumulative sums to get concurrent users
users[, concurrent := cumsum(started) - cumsum(ended)]

这两个复杂的连接可以分为两个（第一个连接，然后按分钟汇总），但我记得读到这种方式更有效。如果没有，拆分它们会使操作更清晰。

Answer 3

R是一种解释性语言，这意味着每次要求它执行命令时，它必须首先解释您的代码，然后执行它。对于循环，它意味着在for的每次迭代中，它必须“重新编译”您的代码，这当然是非常慢的。我知道有三种常见的方法，这有助于解决这个问题。

R是面向矢量的，所以循环很可能不是一种好用的方法。所以，如果可能的话，你应该尝试在这里重新思考你的逻辑，使方法矢量化。
Using just-in-time compiler.
（我最后要做的是）使用Rcpp翻译C / Cpp中的loopy-code。这将使您的速度提升一千倍。

沿着POSIX序列提高速度的速度

3 个答案: