我正在尝试通过group.table逐个推送我的函数并遇到问题。不确定我应该更改我的功能还是我的通话错误。这是一个简单的例子:
数据
test <- data.table(return=c(0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2),
sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
我的功能
zoo_fun <- function(dt, N) {
(rollapply(dt$return + 1, N, FUN=prod, fill=NA, align='right') - 1)
}
运行它(我想创建新的列动量,这将是每个安全性添加一个的最新3个观察的产物(因此按分组=秒)。
test[, momentum3 := zoo_fun(test, 3), by=sec]
Warning messages:
1: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
RHS 1 is length 10 (greater than the size (5) of group 1). The last 5 element(s) will be discarded.
2: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
RHS 1 is length 10 (greater than the size (5) of group 2). The last 5 element(s) will be discarded.
我收到了警告,结果不合预期:
> test
return sec momentum3
1: 0.1 A NA
2: 0.1 A NA
3: 0.1 A 0.331
4: 0.1 A 0.331
5: 0.1 A 0.331
6: 0.2 B NA
7: 0.2 B NA
8: 0.2 B 0.331
9: 0.2 B 0.331
10: 0.2 B 0.331
我期待B秒充满0.728((1.2 * 1.2 * 1.2)-1)并且在开始时有两个NA。我究竟做错了什么?是否滚动功能不适用于分组?
答案 0 :(得分:4)
http://grails-plugins.github.io/grails-database-migration/2.0.x/index.html#groovyPreconditions建议使用reduce()
和shift()
来解决data.table
的滚动窗口问题。 This answer表明这可能比zoo::rollapply()
快得多。
test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
# return sec momentum
# 1: 0.1 A NA
# 2: 0.1 A NA
# 3: 0.1 A 0.331
# 4: 0.1 A 0.331
# 5: 0.1 A 0.331
# 6: 0.2 B NA
# 7: 0.2 B NA
# 8: 0.2 B 0.728
# 9: 0.2 B 0.728
#10: 0.2 B 0.728
microbenchmark::microbenchmark(
zoo = test[, momentum := zoo_fun(return, 3), by = sec][],
red = test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][],
times = 100L
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# zoo 2318.209 2389.131 2445.1707 2421.541 2466.1930 3108.382 100 b
# red 562.465 625.413 663.4893 646.880 673.4715 1094.771 100 a
要使用小数据集验证基准测试结果,需要构建更大的数据集:
n_rows <- 1e4
test0 <- data.table(return = rep(as.vector(outer(1:5/100, 1:2/10, "+")), n_rows),
sec = rep(rep(c("A", "B"), each = 5L), n_rows))
test0
# return sec
# 1: 0.11 A
# 2: 0.12 A
# 3: 0.13 A
# 4: 0.14 A
# 5: 0.15 A
# ---
# 99996: 0.21 B
# 99997: 0.22 B
# 99998: 0.23 B
# 99999: 0.24 B
#100000: 0.25 B
当test
正在进行修改时,每个基准测试运行都会以test0
的新副本开始。
microbenchmark::microbenchmark(
copy = test <- copy(test0),
zoo = {
test <- copy(test0)
test[, momentum := zoo_fun(return, 3), by = sec][]
},
red = {
test <- copy(test0)
test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
},
times = 10L
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# copy 282.619 294.512 325.3261 298.424 350.272 414.983 10 a
# zoo 1129601.974 1144346.463 1188484.0653 1162598.499 1194430.395 1337727.279 10 b
# red 3354.554 3439.095 6135.8794 5002.008 7695.948 11443.595 10 a
对于100k行,Reduce()
/ shift()
方法比zoo::rollapply()
快200多倍。
显然,对预期结果的解释有不同的解释。
要对此进行调查,请使用修改后的数据集:
test <- data.table(return=c(0.1, 0.11, 0.12, 0.13, 0.14, 0.21, 0.22, 0.23, 0.24, 0.25),
sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
test
# return sec
# 1: 0.10 A
# 2: 0.11 A
# 3: 0.12 A
# 4: 0.13 A
# 5: 0.14 A
# 6: 0.21 B
# 7: 0.22 B
# 8: 0.23 B
# 9: 0.24 B
#10: 0.25 B
请注意,每个组中的return
值都不同,这与OP的数据集不同,其中每个return
组的sec
值都是常量。< / p>
这样,This benchmark(rollapply()
)返回
test[, momentum := zoo_fun(return, 3), by = sec][]
# return sec momentum
# 1: 0.10 A NA
# 2: 0.11 A NA
# 3: 0.12 A 0.367520
# 4: 0.13 A 0.404816
# 5: 0.14 A 0.442784
# 6: 0.21 B NA
# 7: 0.22 B NA
# 8: 0.23 B 0.815726
# 9: 0.24 B 0.860744
#10: 0.25 B 0.906500
test[test[ , tail(.I, 3), by = sec]$V1, res := prod(return + 1) - 1, by = sec][]
# return sec res
# 1: 0.10 A NA
# 2: 0.11 A NA
# 3: 0.12 A 0.442784
# 4: 0.13 A 0.442784
# 5: 0.14 A 0.442784
# 6: 0.21 B NA
# 7: 0.22 B NA
# 8: 0.23 B 0.906500
# 9: 0.24 B 0.906500
#10: 0.25 B 0.906500
Reduce()
/ shift()
解决方案返回:
test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
# return sec momentum
# 1: 0.10 A NA
# 2: 0.11 A NA
# 3: 0.12 A 0.367520
# 4: 0.13 A 0.404816
# 5: 0.14 A 0.442784
# 6: 0.21 B NA
# 7: 0.22 B NA
# 8: 0.23 B 0.815726
# 9: 0.24 B 0.860744
#10: 0.25 B 0.906500
答案 1 :(得分:3)
当您使用dt$return
时,整个data.table
将在组内部被选中。只需在函数定义中使用您需要的列,它就可以正常工作:
#use the column instead of the data.table
zoo_fun <- function(column, N) {
(rollapply(column + 1, N, FUN=prod, fill=NA, align='right') - 1)
}
#now it works fine
test[, momentum := zoo_fun(return, 3), by = sec]
作为单独的注释,您可能不应将return
用作列或变量名称。
输出:
> test
return sec momentum
1: 0.1 A NA
2: 0.1 A NA
3: 0.1 A 0.331
4: 0.1 A 0.331
5: 0.1 A 0.331
6: 0.2 B NA
7: 0.2 B NA
8: 0.2 B 0.728
9: 0.2 B 0.728
10: 0.2 B 0.728