TL; DR 我想知道使用基于大型数据集构建的randomForest模型执行批量预测的内存有效方法(数百个功能,数十万行)。
详情:
我正在使用大型数据集(超过3GB,在内存中),并希望使用randomForest
进行简单的二进制分类。由于我的数据是专有的,我无法共享,但可以说以下代码运行
library(randomForest)
library(data.table)
myData <- fread("largeDataset.tsv")
myFeatures <- myData[, !c("response"), with = FALSE]
myResponse <- myData[["response"]]
toBePredicted <- fread("unlabeledData.tsv")
rfObj <- randomForest(x = myFeatures, y = myResponse, ntree = 100L)
predictedLabels <- predict(rfObj, toBePredicted)
但是,它需要几GB的内存。
我知道我可以通过关闭一堆邻近度和重要性度量以及keep.*
参数来节省内存:
rfObjWithPreds <- randomForest(x = myFeatures,
y = myResponse,
proximity = FALSE,
localImp = FALSE,
importance = FALSE,
ntree = 100L,
keep.forest = FALSE,
keep.inbag = FALSE,
xtest = toBePredicted)
但是我现在想知道这是否是获得toBePredicted
预测的最有效记忆方式。我可以做的另一种方法是并行生长森林并积极执行垃圾收集:
library(doParallel)
registerDoParallel(ncores = 5)
subForestVotes <- foreach(subForestNumber = iter(seq.int(5)),
.combine = cbind) %dopar% {
rfObjWithPreds <- randomForest(x = myFeatures,
y = myResponse,
proximity = FALSE,
localImp = FALSE,
importance = FALSE,
ntree = 100L,
keep.forest = FALSE,
keep.inbag = FALSE,
xtest = toBePredicted)
output <- rfObjWithPreds[["test"]][["votes"]]
rm(rfObjWithPreds)
return(output)
}
有没有人能够更有效地预测toBePredicted
?