如果我正确理解ROC,则0.5是具有0预测能力的无效模型。我正在使用相同的数据来拟合ROC为0.64的逻辑回归,因此我认为数据具有一定的预测能力。
我想知道我的配置是否在某个地方不正确:
## tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 5,
savePredictions = TRUE,
verboseIter = TRUE,
classProbs = TRUE,
summaryFunction = my_summary
)
linear_model = train(
x = training_data %>% select(-Avg_Load_Time),
y = target,
trControl = train_control,
method = "glm", # logistic regression
family = "binomial",
metric = "ROC"
)
赋予ROC为0.64。
然后我尝试了一棵树:
tree_model = train(
x = training_data %>% select(-Avg_Load_Time),
y = target,
trControl = train_control,
method = "rpart", # decision tree
metric = "ROC",
tuneLength = 20
)
赋予ROC为0.5
以下是这两个模型的所有评估指标:
> summary(results)
Call:
summary.resamples(object = results)
Models: logit, tree
Number of resamples: 100
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.9817212 0.9824695 0.9824695 0.9823225 0.9824695 0.9824829 0
tree 0.9817352 0.9824695 0.9824695 0.9823226 0.9824695 0.9824695 0
AUC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.9867658 0.9888867 0.990663 0.9896725 0.9907191 0.991328 0
tree 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.000000 0
F
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.9907763 0.9911572 0.9911572 0.9910824 0.9911572 0.9911640 0
tree 0.9907834 0.9911572 0.9911572 0.9910825 0.9911572 0.9911572 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0 0 0 0 0 0 0
tree 0 0 0 0 0 0 0
Precision
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.9817212 0.9824695 0.9824695 0.9823225 0.9824695 0.9824829 0
tree 0.9817352 0.9824695 0.9824695 0.9823226 0.9824695 0.9824695 0
Recall
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 1 1 1 1 1 1 0
tree 1 1 1 1 1 1 0
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.5741854 0.6315647 0.6589653 0.6448492 0.6685837 0.6909468 0
tree 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 1 1 1 1 1 1 0
tree 1 1 1 1 1 1 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0 0 0 0 0 0 0
tree 0 0 0 0 0 0 0
AUC是prAUC,ROC是ROC AUC。 Logit回归的数字符合预期,但对于树而言,因为ROC为0.5,所以看起来有些不对劲。我的train()配置是否存在缺陷?
有关我的数据的更多详细信息:
x是目标已加入训练数据的数据框
summary(x)
userTypeNewVisitor deviceCategorydesktop Traffic_TypePaidTraffic Log_Avg_Load_Time Avg_Load_Time target
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :-1.4271 Min. : 0.24 X0:6446
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.8416 1st Qu.: 2.32 X1: 116
Median :0.0000 Median :0.0000 Median :0.0000 Median : 1.4516 Median : 4.27
Mean :0.3478 Mean :0.2138 Mean :0.4139 Mean : 1.5607 Mean : 10.18
3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.: 2.1668 3rd Qu.: 8.73
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. : 6.1834 Max. :484.62
这是目标类为真的dput:
> dput(glimpse(x %>% filter(target == "X1")))
Observations: 116
Variables: 6
$ userTypeNewVisitor <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, …
$ deviceCategorydesktop <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Traffic_TypePaidTraffic <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, …
$ Log_Avg_Load_Time <dbl> 0.58221562, 0.97077892, 0.98954119, 1.80500470, 1.37371558, 2.38508631, 2.47232787, 2.00417906, 1.43270073, 1.19694819, 0.44468582, 1.68824909, 1.34025042, 1.06815308, 1.28923265, 1.6…
$ Avg_Load_Time <dbl> 1.79, 2.64, 2.69, 6.08, 3.95, 10.86, 11.85, 7.42, 4.19, 3.31, 1.56, 5.41, 3.82, 2.91, 3.63, 4.99, 1.29, 4.60, 8.98, 2.59, 3.01, 5.18, 4.73, 3.75, 3.40, 5.46, 4.65, 3.10, 5.78, 5.81, 1…
$ target <fct> X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1,…
structure(list(userTypeNewVisitor = c(1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 0), deviceCategorydesktop = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), Traffic_TypePaidTraffic = c(0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
0, 0), Log_Avg_Load_Time = c(0.582215619852664, 0.970778917158225,
0.989541193613748, 1.80500469597808, 1.37371557891303, 2.38508631450579,
2.47232786758114, 2.00417905717929, 1.43270073393405, 1.19694818938897,
0.444685821261446, 1.68824909285839, 1.34025042261848, 1.0681530811834,
1.28923264827676, 1.60743590976343, 0.254642218373581, 1.52605630349505,
2.19499988231411, 0.951657875711446, 1.10194007876078, 1.64480505627139,
1.55392520250384, 1.32175583998232, 1.22377543162212, 1.69744878975681,
1.53686721959926, 1.1314021114911, 1.75440368268429, 1.75958057086382,
2.46640317822344, 1.12167756159911, 1.41827740697294, -0.0202027073175195,
1.25276296849537, 1.43508452528932, 2.75110969056266, 0.741937344729377,
0.405465108108164, 0.78845736036427, 1.45161382724053, 2.00552585872967,
1.47704872438835, 2.797890905102, 1.11841491596429, 0.86288995514704,
1.9473377010465, 0.662687973075237, 0.392042087776024, 1.0952733874026,
0.978326122793608, 1.66770682055808, 1.52822785700856, 1.34807314829969,
1.51512723296286, 1.3609765531356, 0.85015092936961, 1.41098697371026,
0.824175442966349, 0.854415328156068, 1.20896034583698, 0.524728528934982,
1.07840958135059, -0.2484613592985, 0.641853886172395, 1.68824909285839,
1.29198368164865, 0.751416088683921, 1.16627093714192, 1.83098018238134,
1.45161382724053, 1.5953389880546, 0.802001585472027, 1.58719230348678,
1.34025042261848, 1.25561603747777, 1.56024766824333, 0.828551817566148,
0.582215619852664, 2.23964529322017, 0.871293365943419, 1.87793716546911,
1.10856261952128, 1.69193913394584, 1.880990602956, 1.35066718347674,
0.774727167552368, 1.36609165380237, 2.10169215061466, 1.24126858906963,
0.904218150639886, 1.26412672714568, 1.67896397508271, 0.350656871613169,
0.431782416425538, 1.54115907168081, 1.45161382724053, 1.34286480319255,
1.25276296849537, 1.4747630091075, 1.51072193949494, 1.10194007876078,
0.908258560176891, 2.36273901581379, 1.42791603581071, 2.10778601468898,
0.615185639090233, 1.24703229378638, 0.810930216216329, 1.19392246847243,
1.37371557891303, 1.56653041142282, 1.07840958135059, 0.27002713721306,
1.55180879959746, 0.797507195884188), Avg_Load_Time = c(1.79,
2.64, 2.69, 6.08, 3.95, 10.86, 11.85, 7.42, 4.19, 3.31, 1.56,
5.41, 3.82, 2.91, 3.63, 4.99, 1.29, 4.6, 8.98, 2.59, 3.01, 5.18,
4.73, 3.75, 3.4, 5.46, 4.65, 3.1, 5.78, 5.81, 11.78, 3.07, 4.13,
0.98, 3.5, 4.2, 15.66, 2.1, 1.5, 2.2, 4.27, 7.43, 4.38, 16.41,
3.06, 2.37, 7.01, 1.94, 1.48, 2.99, 2.66, 5.3, 4.61, 3.85, 4.55,
3.9, 2.34, 4.1, 2.28, 2.35, 3.35, 1.69, 2.94, 0.78, 1.9, 5.41,
3.64, 2.12, 3.21, 6.24, 4.27, 4.93, 2.23, 4.89, 3.82, 3.51, 4.76,
2.29, 1.79, 9.39, 2.39, 6.54, 3.03, 5.43, 6.56, 3.86, 2.17, 3.92,
8.18, 3.46, 2.47, 3.54, 5.36, 1.42, 1.54, 4.67, 4.27, 3.83, 3.5,
4.37, 4.53, 3.01, 2.48, 10.62, 4.17, 8.23, 1.85, 3.48, 2.25,
3.3, 3.95, 4.79, 2.94, 1.31, 4.72, 2.22), target = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("X0", "X1"), class = "factor")), row.names = c(NA,
-116L), class = "data.frame")
我向trainControl sampling = "up"
添加了参数,这是之后的结果:
> summary(results)
Call:
summary.resamples(object = results)
Models: logit, tree
Number of resamples: 100
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.5907012 0.6009139 0.6051829 0.6071289 0.6051829 0.6336634 0
tree 0.9207317 0.9329779 0.9375476 0.9352332 0.9420732 0.9428354 0
AUC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.98673234 0.98959375 0.98968944 0.98978234 0.99070965 0.99218652 0
tree 0.04041049 0.04273307 0.04570514 0.04832824 0.05028082 0.06251166 0
F
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.7399516 0.7473481 0.7502411 0.7522966 0.7507218 0.7732202 0
tree 0.9586974 0.9652723 0.9677419 0.9664735 0.9701023 0.9705536 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.005937197 0.013162973 0.017745938 0.01612017 0.01831303 0.02544174 0
tree -0.008827835 -0.001674637 0.001420743 0.01140350 0.01691454 0.04918470 0
Precision
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.9845361 0.9855769 0.9885204 0.9876619 0.9885932 0.9910828 0
tree 0.9820993 0.9823293 0.9824281 0.9826814 0.9825119 0.9840383 0
Recall
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.5927075 0.6007752 0.6035687 0.6076647 0.6051202 0.6361521 0
tree 0.9363848 0.9487975 0.9534884 0.9508218 0.9565555 0.9588829 0
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.5772759 0.6391583 0.6480283 0.6462557 0.6620063 0.7048099 0
tree 0.4912470 0.4990226 0.5019395 0.5105331 0.5160169 0.5444396 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.5927075 0.6007752 0.6035687 0.6076647 0.6051202 0.6361521 0
tree 0.9363848 0.9487975 0.9534884 0.9508218 0.9565555 0.9588829 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit 0.47826087 0.50000000 0.60869565 0.57826087 0.60869565 0.6956522 0
tree 0.04347826 0.04347826 0.04347826 0.06884058 0.08333333 0.1304348 0
添加sampling = "up"
是适当的做法吗?