是否可以在data.table
中执行多列的递归分配?通过递归我的意思是下一个赋值取决于之前的赋值:
library(data.table)
DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum", "cumsumofcumsum"):=list(cumsum(val), cumsum(cumsum)), by=id]
# Error in `[.data.table`(DT, , `:=`(c("cumsum", "cumsumofcumsum"), list(cumsum(val), :
# cannot coerce type 'builtin' to vector of type 'double'
当然,可以单独进行分配,但我猜测开销成本(例如分组)不会在操作中共享:
DT = data.table(id=rep(LETTERS[1:4], each=2), val=1:8)
DT[, c("cumsum"):=cumsum(val), by=id]
DT[, c("cumsumofcumsum"):=cumsum(cumsum), by=id]
DT
# id val cumsum cumsumofcumsum
# 1: A 1 1 1
# 2: A 2 3 4
# 3: B 3 3 3
# 4: B 4 7 10
# 5: C 5 5 5
# 6: C 6 11 16
# 7: D 7 7 7
# 8: D 8 15 22
答案 0 :(得分:6)
您可以使用临时变量并将其再次用于其他变量:
DT[, c("cumsum", "cumsumofcumsum"):={
x <- cumsum(val)
list(x, cumsum(x))
}, by=id]
当然您可以使用dplyr
并将data.table用作后端,但我不确定您是否会获得与纯data.table方法相同的性能:
library(dplyr)
DT %>%
group_by(id ) %>%
mutate(
cum1 = cumsum(val),
cum2 = cumsum(cum1)
)
Pure data.table解决方案比dplyr解决方案快5倍。我想场景背后的dplyr可以解释这种差异。
f_dt <-
function(){
DT[, c("cumsum", "cumsumofcumsum"):={
x <- as.numeric(cumsum(val))
list(x, cumsum(x))
}, by=id]
}
f_dplyr <-
function(){
DT %>%
group_by(id ) %>%
mutate(
cum1 = as.numeric(cumsum(val)),
cum2 = cumsum(cum1)
)
}
library(microbenchmark)
microbenchmark(f_dt(),f_dplyr(),times = 100)
expr min lq median uq max neval
f_dt() 2.580121 2.97114 3.256156 4.318658 13.49149 100
f_dplyr() 10.792662 14.09490 15.909856 19.593819 159.80626 100