我刚开始学习data.table
并开始研究小插曲 - 尽管我在项目中同时使用它。如何将某些plyr
语法替换为data.table
?
input <- data.table(ID = c(37, 45, 900), a1 = c(1, 2, 3), a2 = c(43, 320,390),
b1 = c(-0.94, 2.2, -1.223), b2 = c(2.32, 4.54, 7.21), c1 = c(1, 2, 3),
c2 = c(-0.94, 2.2, -1.223))
# simple user defined function that conveys my problem
func <- function(x, num) {
x <- data.table(x)
new_b <- x$b1[1]
x2 <- within(x[1,], {
b1 = new_b
b2 = 51
})
imp <- rbindlist(replicate(num, x2, simplify= FALSE))
return(rbindlist(list(x, imp)))
}
# wrapper function
wrap_func <- function(dat, num= 5, plyr= FALSE) {
if (plyr == TRUE) {
return(plyr::ddply(dat, .var= "ID", .fun= func, num= num))
} else {
return(dat[, lapply(.SD, FUN= func, num), by= ID])
}
}
plyr
正常工作wrap_func(dat=input, 5, plyr=TRUE)
data.table
语法是什么?wrap_func(dat=input, num=5, plyr=FALSE) # gives error
提前致谢!!
基于@ Frank在评论中的建议,我根据我的真实数据/代码对此进行了基准测试。在此,impute_zero_resp_all
与示例中的wrap_func
实际等效。
我从一个拥有~50k行和1800组的数据集开始;插补由组完成,产生一个约170k行和相同1800组的数据集:
vec1 <- vec2 <- vector(mode= "numeric", length= 50)
for (i in 1:50) {
vec1[i] <- system.time(impute_zero_resp_all(dat= test_dat2))[3] #DT
vec2[i] <- system.time(impute_zero_resp_all2(dat= test_dat2))[3] #PLYR
}
summary(vec1); summary(vec2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
22.62 22.76 22.81 22.84 22.84 23.72
Min. 1st Qu. Median Mean 3rd Qu. Max.
27.19 27.35 27.40 27.49 27.45 30.07
quantile(vec1, seq(0,1,.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
22.620 22.670 22.728 22.760 22.786 22.810 22.824 22.840 22.870 22.917 23.720
quantile(vec2, seq(0,1,.1))
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
27.190 27.289 27.330 27.357 27.376 27.400 27.424 27.440 27.476 27.522 30.070
sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1