在data.table R

时间:2017-05-17 09:44:03

标签: r data.table grouping

我正在尝试通过group.table逐个推送我的函数并遇到问题。不确定我应该更改我的功能还是我的通话错误。这是一个简单的例子:

数据

 test <- data.table(return=c(0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2),
                   sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))

我的功能

zoo_fun <- function(dt, N) {
  (rollapply(dt$return + 1, N, FUN=prod, fill=NA, align='right') - 1)
}

运行它(我想创建新的列动量,这将是每个安全性添加一个的最新3个观察的产物(因此按分组=秒)。

test[, momentum3 := zoo_fun(test, 3), by=sec]

    Warning messages:
    1: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
      RHS 1 is length 10 (greater than the size (5) of group 1). The last 5 element(s) will be discarded.
    2: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
      RHS 1 is length 10 (greater than the size (5) of group 2). The last 5 element(s) will be discarded.

我收到了警告,结果不合预期:

> test
    return sec momentum3
 1:    0.1   A        NA
 2:    0.1   A        NA
 3:    0.1   A     0.331
 4:    0.1   A     0.331
 5:    0.1   A     0.331
 6:    0.2   B        NA
 7:    0.2   B        NA
 8:    0.2   B     0.331
 9:    0.2   B     0.331
10:    0.2   B     0.331

我期待B秒充满0.728((1.2 * 1.2 * 1.2)-1)并且在开始时有两个NA。我究竟做错了什么?是否滚动功能不适用于分组?

2 个答案:

答案 0 :(得分:4)

http://grails-plugins.github.io/grails-database-migration/2.0.x/index.html#groovyPreconditions建议使用reduce()shift()来解决data.table的滚动窗口问题。 This answer表明这可能比zoo::rollapply()快得多。

test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
#    return sec momentum
# 1:    0.1   A       NA
# 2:    0.1   A       NA
# 3:    0.1   A    0.331
# 4:    0.1   A    0.331
# 5:    0.1   A    0.331
# 6:    0.2   B       NA
# 7:    0.2   B       NA
# 8:    0.2   B    0.728
# 9:    0.2   B    0.728
#10:    0.2   B    0.728

基准(10行,OP数据集)

microbenchmark::microbenchmark(
  zoo = test[, momentum := zoo_fun(return, 3), by = sec][],
  red  = test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][],
  times = 100L
)
#Unit: microseconds
# expr      min       lq      mean   median        uq      max neval cld
#  zoo 2318.209 2389.131 2445.1707 2421.541 2466.1930 3108.382   100   b
#  red  562.465  625.413  663.4893  646.880  673.4715 1094.771   100  a 

基准(100k行)

要使用小数据集验证基准测试结果,需要构建更大的数据集:

n_rows <- 1e4
test0 <- data.table(return = rep(as.vector(outer(1:5/100, 1:2/10, "+")), n_rows),
                   sec = rep(rep(c("A", "B"), each = 5L), n_rows))

test0
#        return sec
#     1:   0.11   A
#     2:   0.12   A
#     3:   0.13   A
#     4:   0.14   A
#     5:   0.15   A
#    ---           
# 99996:   0.21   B
# 99997:   0.22   B
# 99998:   0.23   B
# 99999:   0.24   B
#100000:   0.25   B

test正在进行修改时,每个基准测试运行都会以test0的新副本开始。

microbenchmark::microbenchmark(
  copy = test <- copy(test0),
  zoo  = {
    test <- copy(test0)
    test[, momentum := zoo_fun(return, 3), by = sec][]
  },
  red  = {
    test <- copy(test0)
    test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
  },
  times = 10L
)

#Unit: microseconds
# expr         min          lq         mean      median          uq         max neval cld
# copy     282.619     294.512     325.3261     298.424     350.272     414.983    10  a 
#  zoo 1129601.974 1144346.463 1188484.0653 1162598.499 1194430.395 1337727.279    10   b
#  red    3354.554    3439.095    6135.8794    5002.008    7695.948   11443.595    10  a 

对于100k行,Reduce() / shift()方法比zoo::rollapply()快200多倍。

显然,对预期结果的解释有不同的解释。

要对此进行调查,请使用修改后的数据集:

test <- data.table(return=c(0.1, 0.11, 0.12, 0.13, 0.14, 0.21, 0.22, 0.23, 0.24, 0.25),
                   sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
test
#    return sec
# 1:   0.10   A
# 2:   0.11   A
# 3:   0.12   A
# 4:   0.13   A
# 5:   0.14   A
# 6:   0.21   B
# 7:   0.22   B
# 8:   0.23   B
# 9:   0.24   B
#10:   0.25   B

请注意,每个组中的return值都不同,这与OP的数据集不同,其中每个return组的sec值都是常量。< / p>

这样,This benchmarkrollapply())返回

test[, momentum := zoo_fun(return, 3), by = sec][]
#    return sec momentum
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.367520
# 4:   0.13   A 0.404816
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.815726
# 9:   0.24   B 0.860744
#10:   0.25   B 0.906500

accepted answer返回:

test[test[ , tail(.I, 3), by = sec]$V1, res := prod(return + 1) - 1, by = sec][]
#    return sec      res
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.442784
# 4:   0.13   A 0.442784
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.906500
# 9:   0.24   B 0.906500
#10:   0.25   B 0.906500

Reduce() / shift()解决方案返回:

test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
#    return sec momentum
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.367520
# 4:   0.13   A 0.404816
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.815726
# 9:   0.24   B 0.860744
#10:   0.25   B 0.906500

答案 1 :(得分:3)

当您使用dt$return时,整个data.table将在组内部被选中。只需在函数定义中使用您需要的列,它就可以正常工作:

#use the column instead of the data.table
zoo_fun <- function(column, N) {
  (rollapply(column + 1, N, FUN=prod, fill=NA, align='right') - 1)
}

#now it works fine
test[, momentum := zoo_fun(return, 3), by = sec]

作为单独的注释,您可能不应将return用作列或变量名称。

输出:

> test
    return sec momentum
 1:    0.1   A       NA
 2:    0.1   A       NA
 3:    0.1   A    0.331
 4:    0.1   A    0.331
 5:    0.1   A    0.331
 6:    0.2   B       NA
 7:    0.2   B       NA
 8:    0.2   B    0.728
 9:    0.2   B    0.728
10:    0.2   B    0.728