相关问题-1
我有一个像这样的数据集:
> head(training_data)
year month channelGrouping visitStartTime visitNumber timeSinceLastVisit browser
1 2016 October Social 1477775021 1 0 Chrome
2 2016 September Social 1473037945 1 0 Safari
3 2017 July Organic Search 1500305542 1 0 Chrome
4 2017 July Organic Search 1500322111 2 16569 Chrome
5 2016 August Social 1471890172 1 0 Safari
6 2017 May Direct 1495146428 1 0 Chrome
operatingSystem isMobile continent subContinent country source medium
1 Windows 0 Americas South America Brazil youtube.com referral
2 Macintosh 0 Americas Northern America United States youtube.com referral
3 Windows 0 Americas Northern America Canada google organic
4 Windows 0 Americas Northern America Canada google organic
5 Macintosh 0 Africa Eastern Africa Zambia youtube.com referral
6 Android 1 Americas Northern America United States (direct)
isTrueDirect hits pageviews positiveTransaction
1 0 1 1 No
2 0 1 1 No
3 0 5 5 No
4 1 3 3 No
5 0 1 1 No
6 1 6 6 No
> str(training_data)
'data.frame': 1000 obs. of 18 variables:
$ year : int 2016 2016 2017 2017 2016 2017 2016 2017 2017 2016 ...
$ month : Factor w/ 12 levels "January","February",..: 10 9 7 7 8 5 10 3 3 12 ...
$ channelGrouping : chr "Social" "Social" "Organic Search" "Organic Search" ...
$ visitStartTime : int 1477775021 1473037945 1500305542 1500322111 1471890172 1495146428 1476003570 1488556031 1490323225 1480696262 ...
$ visitNumber : int 1 1 1 2 1 1 1 1 1 1 ...
$ timeSinceLastVisit : int 0 0 0 16569 0 0 0 0 0 0 ...
$ browser : chr "Chrome" "Safari" "Chrome" "Chrome" ...
$ operatingSystem : chr "Windows" "Macintosh" "Windows" "Windows" ...
$ isMobile : int 0 0 0 0 0 1 0 1 0 0 ...
$ continent : Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 1 2 3 3 2 4 ...
$ subContinent : chr "South America" "Northern America" "Northern America" "Northern America" ...
$ country : chr "Brazil" "United States" "Canada" "Canada" ...
$ source : chr "youtube.com" "youtube.com" "google" "google" ...
$ medium : chr "referral" "referral" "organic" "organic" ...
$ isTrueDirect : int 0 0 0 1 0 1 0 0 0 0 ...
$ hits : int 1 1 5 3 1 6 1 1 2 1 ...
$ pageviews : int 1 1 5 3 1 6 1 1 2 1 ...
$ positiveTransaction: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 …
然后我使用Metrics
包定义我的自定义RMSLE函数:
rmsleMetric <- function(data, lev = NULL, model = NULL){
out <- Metrics::rmsle(data$obs, data$pred)
names(out) <- c("rmsle")
return (out)
}
然后,我定义trainControl
:
tc <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
summaryFunction = rmsleMetric,
classProbs = TRUE)
我的网格搜索:
tg <- expand.grid(alpha = 0, lambda = seq(0, 1, by = 0.1))
最后,我的模特:
penalizedLogit_ridge <- train(positiveTransaction ~ .,
data = training_data,
metric="rmsle",
method = "glmnet",
family = "binomial",
trControl = tc,
tuneGrid = tg
)
当我尝试运行上面的命令时,出现错误:
Something is wrong; all the rmsle metric values are missing:
rmsle
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :11
Error: Stopping
In addition: There were 50 or more warnings (use warnings() to see the first 50)
看着警告,我发现:
1: In Ops.factor(1, actual) : ‘+’ not meaningful for factors
2: In Ops.factor(1, predicted) : ‘+’ not meaningful for factors
重复25次
由于如果我使用AUC
作为摘要函数将指标更改为prSummary
,同样的事情也可以正常工作,因此我认为数据没有任何问题。
所以,我认为我的功能是错误的,但是我不知道如何找出错误原因。
我们非常感谢您的帮助。
答案 0 :(得分:2)
您的自定义指标未正确定义。如果将55.0
97.70
和classProbs = TRUE
与savePredictions = "final"
一起使用,您将意识到有两列根据您的目标类命名,它们保留了预测的概率,而trainControl
列则包含了预测的概率该类不能用于计算所需的指标。
定义函数的正确方法是获取可能的级别,并使用它们来提取其中一个类别的概率:
data$pred
能行吗?
rmsleMetric <- function(data, lev = NULL, model = NULL){
lvls <- levels(data$obs)
out <- Metrics::rmsle(ifelse(data$obs == lev[2], 0, 1),
data[, lvls[1]])
names(out) <- c("rmsle")
return (out)
}
您可以检查library(caret)
library(mlbench)
data(Sonar)
tc <- trainControl(method = "repeatedcv",
number = 2,
repeats = 2,
summaryFunction = rmsleMetric,
classProbs = TRUE,
savePredictions = "final")
tg <- expand.grid(alpha = 0, lambda = seq(0, 1, by = 0.1))
penalizedLogit_ridge <- train(Class ~ .,
data = Sonar,
metric="rmsle",
method = "glmnet",
family = "binomial",
trControl = tc,
tuneGrid = tg)
#output
glmnet
208 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (2 fold, repeated 2 times)
Summary of sample sizes: 105, 103, 104, 104
Resampling results across tuning parameters:
lambda rmsle
0.0 0.2835407
0.1 0.2753197
0.2 0.2768288
0.3 0.2797847
0.4 0.2827953
0.5 0.2856088
0.6 0.2881894
0.7 0.2905501
0.8 0.2927171
0.9 0.2947169
1.0 0.2965505
Tuning parameter 'alpha' was held constant at a value of 0
rmsle was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 1.
-它的定义非常相似。