Question

在我的一个应用程序中，有一段代码可以根据另一个对象的值从data.table对象中检索信息。

# say this table contains customers details
dt <- data.table(id=LETTERS[1:4],
                 start=seq(as.Date("2010-01-01"), as.Date("2010-04-01"), "month"),
                 end=seq(as.Date("2010-01-01"), as.Date("2010-04-01"), "month") + c(6,8,10,5),
                 key="id")

# this one has some historical details
dt1 <- data.table(id=rep(LETTERS[1:4], each=120),
                  date=seq(as.Date("2010-01-01"), as.Date("2010-04-30"), "day"),
                  var=rnorm(120),
                  key="id,date")

# and here I finally retrieve my historical information based one customer detail
#
library(data.table)

myfunc <- function(x) {
  # some code
  period <- seq(x$start, x$end, "day")
  dt1[.(x$id, period)][, mean(var)]
  # some code
}

获取我使用adply

的所有结果

library(plyr)
library(microbenchmark)
> adply(dt, 1, myfunc)
   id      start        end         V1
1:  A 2010-01-01 2010-01-07  0.3143536
2:  B 2010-02-01 2010-02-09 -0.5796084
3:  C 2010-03-01 2010-03-11  0.1171404
4:  D 2010-04-01 2010-04-06  0.2384237

> microbenchmark(adply(dt, 1, myfunc))
Unit: milliseconds
                 expr      min       lq   median       uq      max neval
 adply(dt, 1, myfunc) 8.812486 8.998338 9.105776 9.223637 88.14057   100

您是否知道避免adply调用的方法，并在一个data.table语句中执行上述操作？或者无论如何更快的方法？（标题编辑建议超过欢迎，我想不出更好，谢谢）

Answer 1

这是使用roll的{{1}}参数的好地方：

data.table

随着数据集大小的增长，时差将变得更加显着。

Answer 2

我可以给你一堆嵌套的[.data.table电话：

set.seed(1)
require(data.table)
# generate dt, dt1 as above
dt[
    dt1[
        as.list(dt[,seq.Date(start,end,"day"),by="id"])
    ][,mean(var),by=id]
]

#    id      start        end          V1
# 1:  A 2010-01-01 2010-01-07  0.04475859
# 2:  B 2010-02-01 2010-02-09 -0.01681972
# 3:  C 2010-03-01 2010-03-11  0.39791318
# 4:  D 2010-04-01 2010-04-06  0.77854732

我正在使用as.list取消设置密钥。我想知道是否有比这更好的方法......

require(microbenchmark)
require(plyr)
microbenchmark(
    adply=adply(dt, 1, myfunc),
    dtdtdt= dt[dt1[as.list(dt[,seq.Date(start,end,"day"),by="id"])][,mean(var),by=id]]
)

# Unit: milliseconds
#    expr       min        lq    median        uq       max neval
#   adply 12.987334 13.247374 13.477386 14.371258 18.362505   100
#  dtdtdt  4.854708  4.944596  4.993678  5.233507  7.082461   100

编辑：（eddi）上述替代方案需要少量合并（如评论中所述）：

setkey(dt, NULL)

dt1[dt[, list(seq.Date(start,end,"day"), end), by=id]][,
    list(start = date[1], end = end[1], result = mean(var)), by = id]
# or
dt1[dt[, seq.Date(start,end,"day"), by=id]][,
    list(start = date[1], end = date[.N], result = mean(var)), by = id]

将范围的端点与序列合并

2 个答案: