我有一个整洁的数据集,没有缺失值,只有数字列。
数据集既大又包含敏感信息,所以不幸的是我无法在这里提供它的副本。
我使用caret
createDataPartition
将此数据划分为训练和测试集:
idx <- createDataPartition(y = model_final$y, p = 0.6, list = FALSE )
training <- model_final[idx,]
testing <- model_final[-idx,]
x <- training[-ncol(training)]
y <- training$y
x1 <- testing[-ncol(testing)]
y1 <- testing$y
row.names(training) <- NULL
row.names(testing) <- NULL
row.names(x) <- NULL
row.names(y) <- NULL
row.names(x1) <- NULL
row.names(y1) <- NULL
我已定期通过randomForest
对这些数据进行随机森林模型的拟合和重组:
rf <- randomForest(x = x, y = y, mtry = ncol(x), ntree = 1000,
corr.bias = T, do.trace = T, nPerm = 3)
我决定看看我是否可以使用train
获得更好或更快的结果,并且以下模型运行良好,但花了大约2个小时:
rf_train <- train(y=y, x=x,
method='rf', tuneLength = 3,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE
)
我需要采用HPC方法使这在逻辑上可行,所以我尝试了
require(doParallel)
registerDoParallel(cores = 8)
rf_train <- train(y=y, x=x,
method='parRF', tuneGrid = data.frame(mtry = 3), na.action = na.omit,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE, allowParallel = TRUE)
)
但无论我使用的是tuneLength还是tuneGrid,都会导致关于缺失值和调整参数的奇怪错误:
Error in train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
final tuning parameters could not be determined
In addition: Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
2: In train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
missing values found in aggregated results
我说这很奇怪,因为method = "rf"
没有错误,因为我检查了三倍以确保没有遗漏值。
当完全省略调整选项时,我甚至会得到相同的错误。我还尝试打开和关闭na.action
选项,并将"cv"
更改为"repeatedcv"
。
我甚至在这个超简化版本中遇到了同样的错误:
rf_train <- train(y=y, x=x, method='parRF')
答案 0 :(得分:2)
似乎是因为插入符号中的错误。请参阅答案:
parRF on caret not working for more than one core
刚刚处理同样的问题,手动在每个新群集上加载foreach似乎都有效。