我试图使用F1分数来确定哪个k值最大化模型以用于其给定目的。该模型是通过train
包中的caret
函数创建的。
示例数据集:https://www.kaggle.com/lachster/churndata
我目前的代码包括以下内容(作为f1分数的函数):
f1 <- function(data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, positive = "pass")
f1_val <- (2*precision*recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val
}
以下作为列车控制:
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = f1, search = "grid")
以下是我train
命令的最终执行:
x <- train(CHURN ~. ,
data = experiment,
method = "knn",
tuneGrid = expand.grid(.k=1:30),
metric = "F1",
trControl = train.control)
请注意,该模型正试图预测一组电信客户的流失率。
执行返回以下结果: 出了点问题;缺少所有F1指标值:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :30
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
任何帮助都是史诗般的!!
编辑:感谢missuse的帮助,我的代码现在看起来如下但是返回此错误
levels(exp2$CHURN) <- make.names(levels(factor(exp2$CHURN)))
library(mlbench)
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = prSummary, classProbs = TRUE)
knn_fit <- train(CHURN ~., data = exp2, method = "knn", trControl =
train.control, preProcess = c("center", "scale"), tuneLength = 15, metric = "F")
错误:
Error in trainControl(method = "repeatedcv", number = 10, repeats = 3, :
object 'prSummary' not found
答案 0 :(得分:2)
Caret包含一个摘要函数:prSummary
,提供F1分数完整示例:
library(caret)
library(mlbench)
data(Sonar)
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = prSummary, classProbs = TRUE)
knn_fit <- train(Class ~., data = Sonar, method = "knn",
trControl=train.control ,
preProcess = c("center", "scale"),
tuneLength = 15,
metric = "F")
knn_fit
#output
k-Nearest Neighbors
208 samples
60 predictor
2 classes: 'M', 'R'
Pre-processing: centered (60), scaled (60)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 187, 188, 187, 188, 187, 187, ...
Resampling results across tuning parameters:
k AUC Precision Recall F
5 0.3582687 0.7936713 0.9065657 0.8414592
7 0.4985709 0.7758271 0.8883838 0.8239438
9 0.6632328 0.7484092 0.8853535 0.8089210
11 0.7426320 0.7151175 0.8676768 0.7814297
13 0.7388742 0.6883105 0.8646465 0.7641392
15 0.7594436 0.6787983 0.8467172 0.7520524
17 0.7583071 0.6909693 0.8527778 0.7616448
19 0.7702208 0.6913001 0.8585859 0.7644433
21 0.7642698 0.6962528 0.8707071 0.7719442
23 0.7652370 0.6945755 0.8707071 0.7696863
25 0.7606508 0.6929364 0.8707071 0.7683987
27 0.7454728 0.6916762 0.8676768 0.7669464
29 0.7551679 0.6900416 0.8707071 0.7676640
31 0.7603099 0.6935720 0.8828283 0.7749490
33 0.7614621 0.6938805 0.8770202 0.7728923
F was used to select the optimal model using the largest value.
The final value used for the model was k = 5.