创建自定义指标函数以在caret::train
中使用哪个包含参数并且可以汇总训练数据的子集的正确方法是什么?
想象一下,我们有信用评分和贷款数据,并希望培训模型来预测不同类别的贷款(住房抵押贷款,汽车贷款,学生贷款等)中的最高贷款前景我们的资金有限,我们想要使我们的投资组合多样化,因此我们希望确定每个类别中的少数低风险贷款。
例如,我们可以使用GermanLoans
包中的caret
数据。在此培训数据中,每笔贷款分为“好”或“坏”。重新排列一些列后,我们有一列Purpose
,用于标识所请求的贷款类型。
## Load packages
library(data.table); library(caret); library(xgboost); library(Metrics)
## Load data and convert dependent variable (Class) to factor
data(GermanCredit)
setDT(GermanCredit, keep.rownames=TRUE)
GermanCredit[, `:=`(rn=as.numeric(rn), Class=factor(Class, levels=c("Good", "Bad")))]
## Now we need to collapse a few columns...
## - Columns containing purpose for getting loan
colsPurpose <- names(GermanCredit)[names(GermanCredit) %like% "Purpose."]
## - Replace purpose columns with a single factor column
GermanCredit[, Purpose:=melt(GermanCredit, id.var="rn", measure.vars=colsPurpose)[
value==1][order(rn), factor(sub("Purpose.", "", variable))]]
## - Drop purpose columns
GermanCredit[, colsPurpose:=NULL, with=FALSE]
现在我们需要创建自定义指标功能。类似precision at k(其中k
是我们想要在每个类别中提供的贷款数量)平均分组似乎是合适的,但我愿意接受建议。在任何情况下,该函数应如下所示:
twoClassGroup <- function (data, lev=NULL, model=NULL, k, ...) {
if(length(levels(data$obs)) > 2)
stop(paste("Your outcome has", length(levels(data$obs)),
"levels. The twoClassGroup() function isn't appropriate."))
if (!all(levels(data$pred) == levels(data$obs)))
stop("levels of observed and predicted data do not match")
[subset the data, probably using data$rowIndex]
[calculate the metrics, based on data$pred and data$obs]
[return a named vector of metrics]
}
最后,我们可以训练模型。
## Train a model (just an example; may or may not be appropriate for this problem)
creditModel <- train(
Class ~ . - Purpose, data=GermanCredit, method="xgbTree",
trControl=trainControl(
method="cv", number=6, returnResamp="none", summaryFunction=twoClassGroup,
classProbs=TRUE, allowParallel=TRUE, verboseIter=TRUE),
tuneGrid = expand.grid(
nrounds=500, max_depth=6, eta=0.02, gamma=0, colsample_bytree=1, min_child_weight=6),
metric="someCustomMetric", preProc=c("center", "scale"))
## Add predictions
GermanCredit[, `:=`(pred=predict(creditModel, GermanCredit, type="raw"),
prob=predict(creditModel, GermanCredit, type="prob")[[levels(creditModel)[1]]])]
问题
twoClassGroup
来电中将k的值传递给train
?在主函数参数中添加它不起作用,也不在trControl
或tuneGrid
中添加它。twoClassGroup
中对数据进行子集化,以便计算Purpose
每个值中前k个值的模型精度? data
函数中的twoClassGroup
对象与传递给原始train
函数的对象不同。答案 0 :(得分:1)
这种尝试大多有效,但我希望有人可以分享更好的方法。它们不是从for v in sys.modules.values():
if getattr(v, '__warningregistry__', None):
v.__warningregistry__ = {}
传递dt
和k
个参数,而是在train
中“硬编码”。此外,twoClassGroup
的值似乎非常低,尽管最终的模型似乎确实选择了最佳的贷款前景。
Metrics::mapk
在原始帖子的library(Metrics)
twoClassGroup <- function (data, lev=NULL, model=NULL, dt=GermanCredit, k=10) {
if(length(levels(data$obs)) > 2)
stop(paste("Your outcome has", length(levels(data$obs)),
"levels. The twoClassGroup() function isn't appropriate."))
if (!all(levels(data$pred) == levels(data$obs)))
stop("levels of observed and predicted data do not match")
data <- data.table(data, group=dt[data$rowIndex, Purpose])
## You can ignore these extra metrics...
## <-----
sens <- sensitivity(data$pred, data$obs, positive=lev[1])
spec <- specificity(data$pred, data$obs, positive=lev[1])
precision <- posPredValue(data$pred, data$obs)
recall <- sens
Fbeta <- function(precision, recall, beta=1) {
val <- (1+beta^2)*(precision*recall)/(precision*beta^2 + recall)
if(is.nan(val)) val <- 0
return(val)
}
F0.5 <- Fbeta(precision, recall, beta=0.5)
F1 <- Fbeta(precision, recall, beta=1)
F2 <- Fbeta(precision, recall, beta=2)
## ----->
## This is the important one...
mapk <- data[, .(obs=list(obs), pred=list(pred)), by=group][, mapk(k, obs, pred)]
return(c(sensitivity=sens, specificity=spec, F0.5=F0.5, F1=F1, F2=F2, mapk=mapk))
}
来电中,train
的值为“mapk”而不是“someCustomMetric”。