在我的一个应用程序中,有一段代码可以根据另一个对象的值从data.table
对象中检索信息。
# say this table contains customers details
dt <- data.table(id=LETTERS[1:4],
start=seq(as.Date("2010-01-01"), as.Date("2010-04-01"), "month"),
end=seq(as.Date("2010-01-01"), as.Date("2010-04-01"), "month") + c(6,8,10,5),
key="id")
# this one has some historical details
dt1 <- data.table(id=rep(LETTERS[1:4], each=120),
date=seq(as.Date("2010-01-01"), as.Date("2010-04-30"), "day"),
var=rnorm(120),
key="id,date")
# and here I finally retrieve my historical information based one customer detail
#
library(data.table)
myfunc <- function(x) {
# some code
period <- seq(x$start, x$end, "day")
dt1[.(x$id, period)][, mean(var)]
# some code
}
获取我使用adply
library(plyr)
library(microbenchmark)
> adply(dt, 1, myfunc)
id start end V1
1: A 2010-01-01 2010-01-07 0.3143536
2: B 2010-02-01 2010-02-09 -0.5796084
3: C 2010-03-01 2010-03-11 0.1171404
4: D 2010-04-01 2010-04-06 0.2384237
> microbenchmark(adply(dt, 1, myfunc))
Unit: milliseconds
expr min lq median uq max neval
adply(dt, 1, myfunc) 8.812486 8.998338 9.105776 9.223637 88.14057 100
您是否知道避免adply
调用的方法,并在一个data.table
语句中执行上述操作?或者无论如何更快的方法? (标题编辑建议超过欢迎,我想不出更好,谢谢)
答案 0 :(得分:5)
这是使用roll
的{{1}}参数的好地方:
data.table
随着数据集大小的增长,时差将变得更加显着。
答案 1 :(得分:2)
我可以给你一堆嵌套的[.data.table
电话:
set.seed(1)
require(data.table)
# generate dt, dt1 as above
dt[
dt1[
as.list(dt[,seq.Date(start,end,"day"),by="id"])
][,mean(var),by=id]
]
# id start end V1
# 1: A 2010-01-01 2010-01-07 0.04475859
# 2: B 2010-02-01 2010-02-09 -0.01681972
# 3: C 2010-03-01 2010-03-11 0.39791318
# 4: D 2010-04-01 2010-04-06 0.77854732
我正在使用as.list
取消设置密钥。我想知道是否有比这更好的方法......
require(microbenchmark)
require(plyr)
microbenchmark(
adply=adply(dt, 1, myfunc),
dtdtdt= dt[dt1[as.list(dt[,seq.Date(start,end,"day"),by="id"])][,mean(var),by=id]]
)
# Unit: milliseconds
# expr min lq median uq max neval
# adply 12.987334 13.247374 13.477386 14.371258 18.362505 100
# dtdtdt 4.854708 4.944596 4.993678 5.233507 7.082461 100
编辑:(eddi)上述替代方案需要少量合并(如评论中所述):
setkey(dt, NULL)
dt1[dt[, list(seq.Date(start,end,"day"), end), by=id]][,
list(start = date[1], end = end[1], result = mean(var)), by = id]
# or
dt1[dt[, seq.Date(start,end,"day"), by=id]][,
list(start = date[1], end = date[.N], result = mean(var)), by = id]