Question

我正在尝试编写一个函数，该函数采用二元响应变量y和单个解释变量x，运行10倍交叉验证并返回错误分类的响应变量y的比例。然后我想运行这个函数来找出哪些解释变量最适合预测低出生体重。

到目前为止，我已经尝试了

attach(birthwt)
folds <- cut(seq(1,nrow(birthwt)),breaks=10,labels=FALSE) #Create 10 equally size folds
bwt<-glm(low~age+lwt+race+smoke+ptl+ht+ui+ftv,family=binomial)

但我真的不知道从哪里开始..

谢谢

Answer 1

对于每个组，您需要构建训练和测试数据集，使模型适合训练集，然后使用predict.glm函数在测试数据中进行实际预测。以下评论应该是足够的解释。

library("MASS")   # for the dataset
library("dplyr")  # for the filter function
data(birthwt)
set.seed(1)

n.folds <- 10
folds <- cut(sample(seq_len(nrow(birthwt))),  breaks=n.folds, labels=FALSE) # Note!

all.confustion.tables <- list()
for (i in seq_len(n.folds)) {
  # Create training set
  train <- filter(birthwt, folds != i)  # Take all other samples than i

  # Create test set
  test <- filter(birthwt, folds == i)

  # Fit the glm model on the training set
  glm.train <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui + ftv, 
                   family = binomial, data =  train)

  # Use the fitted model on the test set
  logit.prob <- predict(glm.train, newdata = test)

  # Classifiy based on the predictions
  pred.class <- ifelse(logit.prob < 0, 0, 1)

  # Construct the confusion table
  all.confustion.tables[[i]] <- table(pred = pred.class, true = test$low)
}

注意，我随机选择要分组的行。这很重要。

然后，我们可以看到预测与真实值（在混淆表中给出），例如第5次交叉验证运行：

all.confustion.tables[[5]]
#    true
# pred 0 1
#    0 7 8
#    1 4 0

因此，在这次特定的运行中，我们有7个真阳性，4个假阳性，8个假阴性和0个真阴性。

我们现在可以从混淆表中计算出我们想要的任何性能指标。

# Compute the average misclassification risk.
misclassrisk <- function(x) { (sum(x) - sum(diag(x)))/sum(x) }
risk <- sapply(all.confustion.tables, misclassrisk
mean(risk)
#[1] 0.3119883

因此，在这种情况下，我们的错误分类风险约为31％。我会留给你把它包装成一个函数。

k-fold交叉验证以找出错误分类率

1 个答案: