以下是我的示例数据集:
# TEMP DATA
train_predictors <- matrix(data = c(1,2,
1,3,
2,4,
3,5,
4,6,
5,4,
6,5,
6,6,
7,7,
8,8), nrow = 10, ncol = 2)
train_labels <- c(1,1,1,1,1,0,0,0,0,0)
test_predictors <- matrix(data = c(1,2), nrow = 1, ncol = 2)
# PREPROCESSING OF DATA
train_predictors <- as.data.frame(train_predictors)
test_predictors <- as.data.frame(test_predictors)
train_labels <- as.factor(train_labels)
这就是如何在train_predictors
和train_labels
上训练一个简单的随机森林。
# APPLY SIMPLE RANDOM FOREST ON TRAIN DATA
my_train_control <- trainControl(method = "cv",
number = 2,
savePredictions = TRUE,
classProbs = TRUE)
rf_model <- train(x = train_predictors,
y = train_labels,
trControl = my_train_control,
tuneLength = 1)
您将收到以下警告:
Warning message:
In train.default(x = train_predictors, y = train_labels, trControl = my_train_control, :
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
但这只是因为0,1被用作类标签(因此在预测'数据帧中创建列时,它创建的列为X0和X1而不是0和1) - 正如 Max Kuhn所解释的那样(topepo)
我能够在测试数据点上提取类预测,如下所示:
prediction_class_on_test_data <- predict(rf_model, test_predictors)
prediction_class_on_test_data <- as.numeric(as.character(prediction_class_on_test_data))
但是当我尝试按如下方式预测测试数据点的概率时:
prediction_prob_on_test_data <- predict(rf_model, test_predictors, type = "prob")
prediction_prob_on_test_data <- as.numeric(as.character(prediction_prob_on_test_data))
我收到以下错误:
Error in `[.data.frame`(out, , obsLevels, drop = FALSE) :
undefined columns selected
我确信某处有一个简单的错误,但我做错了什么?
更新
我能够使用extractProb函数获取测试数据集的类概率和预测,如下所示:
dummy_test_labels <- rep(0, nrow(test_predictors))
predictions_on_complete_data <- extractProb(models = list(rf_model), testX = test_predictors, testY = dummy_test_labels)
predictions_on_test_data <- predictions_on_complete_data[predictions_on_complete_data$dataType == "Test", ]
但仍不确定为什么predict()
无法使用type="prob"
。