在r中实现交叉验证时,下标超出范围

时间:2017-10-31 05:08:13

标签: r cross-validation r-caret

我在[.default(cm,2,2):下标超出范围 的 错误实现了xgboost的交叉验证。我的数据集结构如下:

'data.frame':   889 obs. of  7 variables:
 $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
 $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Sex     : num  1 2 2 2 1 1 1 1 2 2 ...
 $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Embarked: num  3 1 3 3 3 2 3 3 3 1 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:2] 62 830
  .. ..- attr(*, "names")= chr [1:2] "62" "830"

我的数据集摘要如下:

 Survived     Pclass           Sex            SibSp            Parch       
 0:549    Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
 1:340    1st Qu.:2.000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000  
          Median :3.000   Median :1.000   Median :0.0000   Median :0.0000  
          Mean   :2.312   Mean   :1.351   Mean   :0.5242   Mean   :0.3825  
          3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:0.0000  
          Max.   :3.000   Max.   :2.000   Max.   :8.0000   Max.   :6.0000  
      Fare            Embarked    
 Min.   :  0.000   Min.   :1.000  
 1st Qu.:  7.896   1st Qu.:2.000  
 Median : 14.454   Median :3.000  
 Mean   : 32.097   Mean   :2.535  
 3rd Qu.: 31.000   3rd Qu.:3.000  
 Max.   :512.329   Max.   :3.000

实现以下代码时抛出错误:

library(caret)
folds = createFolds(traindataset$Survived, k = 10)
cv = lapply(folds, function(x) {
  training_fold = traindataset[-x, ]
  test_fold = traindataset[x, ]
  classifier = xgboost(data = as.matrix(traindataset[-1]), label = traindataset$Survived, nrounds = 10)
  y_pred = predict(classifier, newdata = as.matrix(test_fold[-1]))
  y_pred = (y_pred >= 0.5)
  cm = table(test_fold[, 1], y_pred)
  accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
  return(accuracy)
})

请注意,我已将 Survived 从0和1的整数转换为用于分类目的的因子。令我惊讶的是,当Survived是一个整数时,代码可以工作,但是当它是一个因素时,我得到了这个错误。

感谢任何帮助。谢谢。

1 个答案:

答案 0 :(得分:1)

我找到了问题的解决方案。对此给您带来的不便表示歉意。

在这里,我将目标变量转换为因子正在产生问题。我假设xgboost需要数字输入而不是因素,因此产生了问题。