data.table中的递归赋值

时间:2014-10-04 11:12:18

标签: r data.table

是否可以在data.table中执行多列的递归分配?通过递归我的意思是下一个赋值取决于之前的赋值:

library(data.table)
DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum", "cumsumofcumsum"):=list(cumsum(val), cumsum(cumsum)), by=id]

# Error in `[.data.table`(DT, , `:=`(c("cumsum", "cumsumofcumsum"), list(cumsum(val),  : 
#   cannot coerce type 'builtin' to vector of type 'double'

当然,可以单独进行分配,但我猜测开销成本(例如分组)不会在操作中共享:

DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum"):=cumsum(val), by=id]
DT[, c("cumsumofcumsum"):=cumsum(cumsum), by=id]
DT
#    id val cumsum cumsumofcumsum
# 1:  A   1      1              1
# 2:  A   2      3              4
# 3:  B   3      3              3
# 4:  B   4      7             10
# 5:  C   5      5              5
# 6:  C   6     11             16
# 7:  D   7      7              7
# 8:  D   8     15             22

1 个答案:

答案 0 :(得分:6)

您可以使用临时变量并将其再次用于其他变量:

DT[, c("cumsum", "cumsumofcumsum"):={
              x <- cumsum(val)
              list(x, cumsum(x))
              }, by=id]

当然您可以使用dplyr并将data.table用作后端,但我不确定您是否会获得与纯data.table方法相同的性能:

library(dplyr)
DT %>%
  group_by(id ) %>%
  mutate(
       cum1 = cumsum(val),
       cum2 = cumsum(cum1)
)

编辑添加一些benchamrks:

Pure data.table解决方案比dplyr解决方案快5倍。我想场景背后的dplyr可以解释这种差异。

f_dt <- 
  function(){
DT[, c("cumsum", "cumsumofcumsum"):={
  x <- as.numeric(cumsum(val))
  list(x, cumsum(x))
}, by=id]
}

f_dplyr <- 
  function(){
DT %>%
  group_by(id ) %>%
  mutate(
       cum1 = as.numeric(cumsum(val)),
       cum2 = cumsum(cum1)
)
}


library(microbenchmark)

microbenchmark(f_dt(),f_dplyr(),times = 100)
    expr       min       lq    median        uq       max neval
    f_dt()  2.580121  2.97114  3.256156  4.318658  13.49149   100
 f_dplyr() 10.792662 14.09490 15.909856 19.593819 159.80626   100