Question

我有一个训练数据集，我们称它为“ training_data”，它由19个变量（功能）和1个标签（总共20个变量（列））组成。该数据集仅包含最佳预测变量，这意味着已删除低方差列或不良预测变量，这是从特征选择中得到的数据帧。我们将此数据集中的标签称为：“ final_score”

此外，我有一个测试数据集，以免将其称为“ predictions_data”，它具有相同的19个变量（功能），但没有标签变量，因此，总共有19个变量（列））。

我正在做一个非常简单的回归模型，使用Caret的“ train”方法中的“套索回归”来训练模型并进一步预测“ final_score”。

我的代码如下：

predictions_data

到目前为止，一切进展顺利，该模型显示了交叉验证的最佳结果以及所获得的指标（RMSE，MAE等）。

现在，我想将模型应用于“ # Import training data as a data frame: training_data <- data.frame(training_data) # Set cross validation folds and times: fitControl <- trainControl(method = "repeatedcv", number = 3, # number of folds repeats = 3) # repeated three times # Train the model using "lasso" regression from train method. I've called the model as "model.cv": model.cv <- train(final_score ~ ., data = training_data, method = "lasso", trControl = fitControl, preProcess = c('scale', 'center'))”，以便模型可以“预测” predictions_data。

我尝试执行此操作的代码是：

final_score

问题来了。甚至我说# Import test data set to a data frame (with no label column): predictions_data <- data.frame(predictions_data) # Apply the model using predict function from Caret, and save them in an object called: "predictions": predictions <- predict(model.cv, newdata = predictions_data)，预测对象都返回训练数据集的预测标签，而不是测试数据集……我在做什么错？（当然，这是一个非常基本的模型，但是事件应该可以与预测一起使用，对吧？）

谢谢！

Answer 1

测试数据集包含一些格式不正确的数据（即数值列中的NA），而不是为训练而清理/准备的训练数据集。清理/准备好测试数据后，预测功能便会正确执行。

使用Caret的Train方法将模型应用于测试数据集以预测R中的标签的问题

1 个答案: