我正在使用data.table运行以下代码,我想更好地理解触发GForce的条件是什么
DT = data.table(date = rep(seq(Sys.Date(), by = "-1 day", length.out = 1000), 10),
x = runif(10000),
id = rep(1:10, each = 1000))
对于下面的情况,我可以看到它有效:
DT[, .(max(x), min(x), mean(x)), by = id, verbose = T]
Detected that j uses these columns: x
Finding groups using forderv ... 0 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
lapply optimization is on, j unchanged as 'list(max(x), min(x), mean(x))'
GForce optimized j to 'list(gmax(x), gmin(x), gmean(x))'
Making each group and running j (GForce TRUE) ... 0 secs
但是对于我的用例,它不是
window1 <- Sys.Date() - 50
window2 <- Sys.Date() - 150
window3 <- Sys.Date() - 550
DT[, .(max(x[date > Sys.Date() - 50]), max(x[date > Sys.Date() - 150]),
max(x[date > Sys.Date() - 550])), by = id, verbose = T]
Detected that j uses these columns: x,date
Finding groups using forderv ... 0 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
lapply optimization is on, j unchanged as 'list(max(x[date > Sys.Date() - 50]), max(x[date > Sys.Date() - 150]), max(x[date > Sys.Date() - 550]))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ...
memcpy contiguous groups took 0.000s for 10 groups
eval(j) took 0.005s for 10 calls
0.005 secs
我唯一想到的事实是,max函数中的每个向量都有不同的长度。
答案 0 :(得分:2)
我做了一个非平等的加入:
# convert to IDate for speed
DT[, date := as.IDate(date)]
mDT = CJ(id = unique(DT$id), days_ago = c(50L, 150L, 550L))
mDT[, date_dn := as.IDate(Sys.Date()) - days_ago]
res = DT[mDT, on=.(id, date > date_dn), .(
days_ago = first(days_ago),
m = mean(x)
), by=.EACHI, verbose=TRUE]
打印出来......
Non-equi join operators detected ...
forder took ... 0 secs
Generating group lengths ... done in 0 secs
Generating non-equi group ids ... done in 0.01 secs
Found 1 non-equi group(s) ...
Starting bmerge ...done in 0 secs
Detected that j uses these columns: days_ago,x
lapply optimization is on, j unchanged as 'list(first(days_ago), mean(x))'
Old mean optimization changed j from 'list(first(days_ago), mean(x))' to 'list(first(days_ago), .External(Cfastmean, x, FALSE))'
Making each group and running j (GForce FALSE) ...
collecting discontiguous groups took 0.000s for 30 groups
eval(j) took 0.000s for 30 calls
0 secs
因此,出于某种原因,这使用了另一种形式的优化而不是GForce。
结果看起来像......
id date days_ago m
1: 1 2017-12-19 50 0.4435722
2: 1 2017-09-10 150 0.4842963
3: 1 2016-08-06 550 0.4775890
4: 2 2017-12-19 50 0.4838715
5: 2 2017-09-10 150 0.5150688
6: 2 2016-08-06 550 0.5141174
7: 3 2017-12-19 50 0.4804182
8: 3 2017-09-10 150 0.4910027
9: 3 2016-08-06 550 0.4901343
10: 4 2017-12-19 50 0.4644922
11: 4 2017-09-10 150 0.4902132
12: 4 2016-08-06 550 0.4810129
13: 5 2017-12-19 50 0.4666715
14: 5 2017-09-10 150 0.5193629
15: 5 2016-08-06 550 0.4850173
16: 6 2017-12-19 50 0.5318109
17: 6 2017-09-10 150 0.5481641
18: 6 2016-08-06 550 0.5216787
19: 7 2017-12-19 50 0.4500243
20: 7 2017-09-10 150 0.4915983
21: 7 2016-08-06 550 0.5055563
22: 8 2017-12-19 50 0.4958809
23: 8 2017-09-10 150 0.4915432
24: 8 2016-08-06 550 0.4981277
25: 9 2017-12-19 50 0.5833083
26: 9 2017-09-10 150 0.5160464
27: 9 2016-08-06 550 0.5091702
28: 10 2017-12-19 50 0.4946466
29: 10 2017-09-10 150 0.4798743
30: 10 2016-08-06 550 0.5030687
id date days_ago m
据我所知,当函数的参数(mean
此处)是一个像x
这样的简单列而不是像{{1}这样的表达式时,这种优化只会启动。 }。
答案 1 :(得分:0)
我已经运行@Frank建议的解决方案并获得以下
DT[, date := as.IDate(date)]
mDT = CJ(id = unique(DT$id), days_ago = c(50L, 150L, 550L))
mDT[, date_dn := as.IDate(Sys.Date()) - days_ago]
cDT <- copy(DT) # To make sure we run different methods on different datasets
window1 <- Sys.Date() - 50
window2 <- Sys.Date() - 150
window3 <- Sys.Date() - 550
microbenchmark(
cDT[mDT, on=.(id, date > date_dn), .(days_ago = first(days_ago), m = mean(x)), by=.EACHI],
DT[, .(mean(x[date > window1]), mean(x[date > window2]), mean(x[date > window3])), by = id]
)
Unit: microseconds
expr
cDT[mDT, on = .(id, date > date_dn), .(days_ago = first(days_ago), m = mean(x)), by = .EACHI]
DT[, .(mean(x[date > window1]), mean(x[date > window2]), mean(x[date > window3])), by = id]
min lq mean median uq max neval cld
822.451 1462.756 1708.083 2481.601 2875.785 4459.506 100 b
1948.851 2313.842 2626.432 1565.562 1710.693 8717.868 100 a
如果加入费用更高,那么我不会感到惊讶
答案 2 :(得分:0)
正在寻找如何强制GForce打开并遇到此操作的机会。
mtd3
包含一种为该特定OP开启GForce的方法。但这仍然不比OP的方法快。
mtd1 <- function() {
mDT = CJ(id = unique(DT1$id), days_ago = c(50L, 150L, 550L))
mDT[, date_dn := as.IDate(Sys.Date()) - days_ago]
res = DT1[mDT, on=.(id, date > date_dn), .(
days_ago = first(days_ago),
m = mean(x)
), by=.EACHI]
}
mtd2 <- function() {
DT2[, .(
max(x[date > window1]),
max(x[date > window2]),
max(x[date > window3])
), by = id]
}
mtd3 <- function() {
#Reduce(function(x, y) x[y, on="id"],
lapply(c(window1, window2, window3),
function(d) DT3[date > d, .(max(x)), by = id, verbose=T])
#)
}
library(microbenchmark)
microbenchmark(mtd1(), mtd2(), mtd3(), times=1L)
mtd3()打印出来:
i clause present and columns used in by detected, only these subset: id
Detected that j uses these columns: x
Finding groups using forderv ... 0.000sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'list(max(x))'
GForce optimized j to 'list(gmax(x))'
Making each group and running j (GForce TRUE) ... 0.000sec
i clause present and columns used in by detected, only these subset: id
Detected that j uses these columns: x
Finding groups using forderv ... 0.000sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'list(max(x))'
GForce optimized j to 'list(gmax(x))'
Making each group and running j (GForce TRUE) ... 0.030sec
i clause present and columns used in by detected, only these subset: id
Detected that j uses these columns: x
Finding groups using forderv ... 0.000sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec
lapply optimization is on, j unchanged as 'list(max(x))'
GForce optimized j to 'list(gmax(x))'
Making each group and running j (GForce TRUE) ... 0.080sec
时间:
Unit: milliseconds
expr min lq mean median uq max neval
mtd1() 323.3229 323.3229 323.3229 323.3229 323.3229 323.3229 1
mtd2() 249.8188 249.8188 249.8188 249.8188 249.8188 249.8188 1
mtd3() 479.5279 479.5279 479.5279 479.5279 479.5279 479.5279 1
数据:
library(data.table)
n <- 1e7
m <- 10
DT = data.table(
id=sample(1:m, n/m, replace=TRUE),
date=sample(seq(Sys.Date(), by="-1 day", length.out=1000), n, replace=TRUE),
x=runif(n))
window1 <- Sys.Date() - 50
window2 <- Sys.Date() - 150
window3 <- Sys.Date() - 550
DT[, date := as.IDate(date)]
setorder(DT, id, date)
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)