按组“精确度为k”的自定义插入符号度量

时间:2016-08-03 22:37:40

标签: r r-caret

创建自定义指标函数以在caret::train中使用哪个包含参数并且可以汇总训练数据的子集的正确方法是什么?

想象一下,我们有信用评分和贷款数据,并希望培训模型来预测不同类别的贷款(住房抵押贷款,汽车贷款,学生贷款等)中的最高贷款前景我们的资金有限,我们想要使我们的投资组合多样化,因此我们希望确定每个类别中的少数低风险贷款。

例如,我们可以使用GermanLoans包中的caret数据。在此培训数据中,每笔贷款分为“好”或“坏”。重新排列一些列后,我们有一列Purpose,用于标识所请求的贷款类型。

## Load packages
library(data.table); library(caret); library(xgboost); library(Metrics)

## Load data and convert dependent variable (Class) to factor
data(GermanCredit)
setDT(GermanCredit, keep.rownames=TRUE)
GermanCredit[, `:=`(rn=as.numeric(rn), Class=factor(Class, levels=c("Good", "Bad")))]

## Now we need to collapse a few columns...
##  - Columns containing purpose for getting loan
colsPurpose <- names(GermanCredit)[names(GermanCredit) %like% "Purpose."]

##  - Replace purpose columns with a single factor column
GermanCredit[, Purpose:=melt(GermanCredit, id.var="rn", measure.vars=colsPurpose)[
  value==1][order(rn), factor(sub("Purpose.", "", variable))]]

##  - Drop purpose columns
GermanCredit[, colsPurpose:=NULL, with=FALSE]

现在我们需要创建自定义指标功能。类似precision at k(其中k是我们想要在每个类别中提供的贷款数量)平均分组似乎是合适的,但我愿意接受建议。在任何情况下,该函数应如下所示:

twoClassGroup <- function (data, lev=NULL, model=NULL, k, ...) {
  if(length(levels(data$obs)) > 2)
    stop(paste("Your outcome has", length(levels(data$obs)),
               "levels. The twoClassGroup() function isn't appropriate."))
  if (!all(levels(data$pred) == levels(data$obs)))
    stop("levels of observed and predicted data do not match")

  [subset the data, probably using data$rowIndex]

  [calculate the metrics, based on data$pred and data$obs]

  [return a named vector of metrics]
}

最后,我们可以训练模型。

## Train a model (just an example; may or may not be appropriate for this problem)
creditModel <- train(
  Class ~ . - Purpose, data=GermanCredit, method="xgbTree", 
  trControl=trainControl(
    method="cv", number=6, returnResamp="none", summaryFunction=twoClassGroup,
    classProbs=TRUE, allowParallel=TRUE, verboseIter=TRUE),
  tuneGrid = expand.grid(
    nrounds=500, max_depth=6, eta=0.02, gamma=0, colsample_bytree=1, min_child_weight=6),
  metric="someCustomMetric", preProc=c("center", "scale"))

## Add predictions
GermanCredit[, `:=`(pred=predict(creditModel, GermanCredit, type="raw"),
                    prob=predict(creditModel, GermanCredit, type="prob")[[levels(creditModel)[1]]])]

问题

  • 如何从twoClassGroup来电中将k的值传递给train?在主函数参数中添加它不起作用,也不在trControltuneGrid中添加它。
  • 如何在twoClassGroup中对数据进行子集化,以便计算Purpose每个值中前k个值的模型精度? data函数中的twoClassGroup对象与传递给原始train函数的对象不同。

1 个答案:

答案 0 :(得分:1)

这种尝试大多有效,但我希望有人可以分享更好的方法。它们不是从for v in sys.modules.values(): if getattr(v, '__warningregistry__', None): v.__warningregistry__ = {} 传递dtk个参数,而是在train中“硬编码”。此外,twoClassGroup的值似乎非常低,尽管最终的模型似乎确实选择了最佳的贷款前景。

Metrics::mapk

在原始帖子的library(Metrics) twoClassGroup <- function (data, lev=NULL, model=NULL, dt=GermanCredit, k=10) { if(length(levels(data$obs)) > 2) stop(paste("Your outcome has", length(levels(data$obs)), "levels. The twoClassGroup() function isn't appropriate.")) if (!all(levels(data$pred) == levels(data$obs))) stop("levels of observed and predicted data do not match") data <- data.table(data, group=dt[data$rowIndex, Purpose]) ## You can ignore these extra metrics... ## <----- sens <- sensitivity(data$pred, data$obs, positive=lev[1]) spec <- specificity(data$pred, data$obs, positive=lev[1]) precision <- posPredValue(data$pred, data$obs) recall <- sens Fbeta <- function(precision, recall, beta=1) { val <- (1+beta^2)*(precision*recall)/(precision*beta^2 + recall) if(is.nan(val)) val <- 0 return(val) } F0.5 <- Fbeta(precision, recall, beta=0.5) F1 <- Fbeta(precision, recall, beta=1) F2 <- Fbeta(precision, recall, beta=2) ## -----> ## This is the important one... mapk <- data[, .(obs=list(obs), pred=list(pred)), by=group][, mapk(k, obs, pred)] return(c(sensitivity=sens, specificity=spec, F0.5=F0.5, F1=F1, F2=F2, mapk=mapk)) } 来电中,train的值为“mapk”而不是“someCustomMetric”。