Question

我使用glmnet在训练集上建立预测模型，其中包含约200个预测变量和100个样本，用于二项式回归/分类问题。

我选择了给我最大AUC的最佳模型（16个预测器）。我有一个独立的测试集，只有那些变量（16个预测变量），这些变量使它成为训练集的最终模型。

有没有办法根据训练集中的最佳模型使用predict.glmnet，新的测试集只包含那些使其成为训练集最终模型的变量的数据？

Answer 1

glmnet要求训练数据集中的变量的完全相同的数量/名称在验证/测试集中。例如：

library(caret)
library(glmnet)
df <- ... # a dataframe with 200 variables, some of which you want to predict on 
      #  & some of which you don't care about.
      # Variable 13 ('Response.Variable') is the dependent variable.
      # Variables 1-12 & 14-113 are the predictor variables
      # All training/testing & validation datasets are derived from this single df.

# Split dataframe into training & testing sets
inTrain <- createDataPartition(df$Response.Variable, p = .75, list = FALSE)
Train <- df[ inTrain, ] # Training dataset for all model development
Test <- df[ -inTrain, ] # Final sample for model validation

# Run logistic regression , using only specified predictor variables 
logCV <- cv.glmnet(x = data.matrix(Train[, c(1:12,14:113)]), y = Train[,13],
family = 'binomial', type.measure = 'auc')

# Test model over final test set, using specified predictor variables
# Create field in dataset that contains predicted values
Test$prob <- predict(logCV,type="response", newx = data.matrix(Test[,   
                     c(1:12,14:113) ]), s = 'lambda.min')

对于一组全新的数据，您可以使用以下方法的某些变体将新df约束为必要的变量：

new.df <- ... # new df w/ 1,000 variables, which include all predictor variables used 
              # in developing the model

# Create object with requisite predictor variable names that we specified in the model
predictvars <- c('PredictorVar1', 'PredictorVar2', 'PredictorVar3', 
                  ... 'PredictorVarK')
new.df$prob <- predict(logCV,type="response", newx = data.matrix(new.df[names(new.df)
                        %in% predictvars ]), s = 'lambda.min')
                       # the above method limits the new df of 1,000 variables to                                                     
                       # whatever the requisite variable names or indices go into the 
                       # model.

此外，glmnet仅处理矩阵。这可能就是为什么你收到你在问题评论中发布的错误的原因。一些用户（包括我自己）发现as.matrix()无法解决问题; data.matrix()似乎可以工作（因此，为什么它在上面的代码中）。这个问题在SO上的一两个线程中解决。

我假设要预测的新数据集中的所有变量也需要与用于模型开发的数据集中的格式相同。我通常从同一个来源提取所有数据，因此在格式不同的情况下我没有遇到glmnet会做什么。

我可以使用不同数量的预测变量对测试数据进行预测吗？

1 个答案: