简单问题(我认为) - 通过插入符号包在KNN中使用F1得分指标

时间:2018-03-24 15:38:44

标签: r r-caret

我试图使用F1分数来确定哪个k值最大化模型以用于其给定目的。该模型是通过train包中的caret函数创建的。

示例数据集:https://www.kaggle.com/lachster/churndata

我目前的代码包括以下内容(作为f1分数的函数):

f1 <- function(data, lev = NULL, model = NULL) {
    precision <- posPredValue(data$pred, data$obs, positive = "pass")
    recall <- sensitivity(data$pred, data$obs, positive = "pass")
    f1_val <- (2*precision*recall) / (precision + recall)
    names(f1_val) <- c("F1")
    f1_val
}

以下作为列车控制:

train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, 
summaryFunction = f1, search = "grid")

以下是我train命令的最终执行:

x <- train(CHURN ~. , 
  data = experiment, 
  method = "knn", 
  tuneGrid = expand.grid(.k=1:30), 
  metric = "F1", 
  trControl = train.control)

请注意,该模型正试图预测一组电信客户的流失率。

执行返回以下结果: 出了点问题;缺少所有F1指标值:

       F1     
 Min.   : NA  
 1st Qu.: NA  
 Median : NA  
 Mean   :NaN  
 3rd Qu.: NA  
 Max.   : NA  
 NA's   :30   
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

任何帮助都是史诗般的!!

编辑:感谢missuse的帮助,我的代码现在看起来如下但是返回此错误

    levels(exp2$CHURN) <- make.names(levels(factor(exp2$CHURN)))

    library(mlbench)

    train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, 
summaryFunction = prSummary, classProbs = TRUE)

    knn_fit <- train(CHURN ~., data = exp2, method = "knn", trControl = 
train.control, preProcess = c("center", "scale"), tuneLength = 15, metric = "F")

错误:

Error in trainControl(method = "repeatedcv", number = 10, repeats = 3,  : 
  object 'prSummary' not found

1 个答案:

答案 0 :(得分:2)

Caret包含一个摘要函数:prSummary,提供F1分数完整示例:

library(caret)
library(mlbench)
data(Sonar)

train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, 
                              summaryFunction = prSummary, classProbs = TRUE)


knn_fit <- train(Class ~., data = Sonar, method = "knn",
                 trControl=train.control ,
                 preProcess = c("center", "scale"),
                 tuneLength = 15,
                 metric = "F")
knn_fit
#output
k-Nearest Neighbors 

208 samples
 60 predictor
  2 classes: 'M', 'R' 

Pre-processing: centered (60), scaled (60) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 187, 188, 187, 188, 187, 187, ... 
Resampling results across tuning parameters:

  k   AUC        Precision  Recall     F        
   5  0.3582687  0.7936713  0.9065657  0.8414592
   7  0.4985709  0.7758271  0.8883838  0.8239438
   9  0.6632328  0.7484092  0.8853535  0.8089210
  11  0.7426320  0.7151175  0.8676768  0.7814297
  13  0.7388742  0.6883105  0.8646465  0.7641392
  15  0.7594436  0.6787983  0.8467172  0.7520524
  17  0.7583071  0.6909693  0.8527778  0.7616448
  19  0.7702208  0.6913001  0.8585859  0.7644433
  21  0.7642698  0.6962528  0.8707071  0.7719442
  23  0.7652370  0.6945755  0.8707071  0.7696863
  25  0.7606508  0.6929364  0.8707071  0.7683987
  27  0.7454728  0.6916762  0.8676768  0.7669464
  29  0.7551679  0.6900416  0.8707071  0.7676640
  31  0.7603099  0.6935720  0.8828283  0.7749490
  33  0.7614621  0.6938805  0.8770202  0.7728923

F was used to select the optimal model using the largest value.
The final value used for the model was k = 5.