我是一名精算学生,正在为即将在12月举行的预测分析考试做准备。练习的一部分是使用带有插入符号和xgbTree的boosting建立模型。参见下面的代码,商队数据集来自ISLR包:
server <- function(input, output, session) {
df<-reactive({
df<-iris
if(input$Petalw==T)
{
df<-df[df$Petal.Width==0.2,]
}
else{
df
}
})
output$table <- DT::renderDataTable(
DT::datatable(df(), options = list(searching = FALSE,pageLength = 25))
)
}
ui <- navbarPage(
title = 'Select values in two columns based on two inputs respectively',
fluidRow(
column(width = 3,
checkboxInput("Petalw","PetalWithIs0.2",T),
checkboxInput("PetalL","PetalLengthis1.4",T)
),
column(9,
tabPanel('Table', DT::dataTableOutput('table'))
)
)
)
shinyApp(ui, server)
expand.grid和trainControl中的定义是由问题指定的,但我不断收到错误消息:
library(caret)
library(ggplot2)
set.seed(1000)
data.Caravan <- read.csv(file = "Caravan.csv")
data.Caravan$Purchase <- factor(data.Caravan$Purchase)
levels(data.Caravan$Purchase) <- c("No", "Yes")
data.Caravan.train <- data.Caravan[1:1000, ]
data.Caravan.test <- data.Caravan[1001:nrow(data.Caravan), ]
grid <- expand.grid(max_depth = c(1:7),
nrounds = 500,
eta = c(.01, .05, .01),
colsample_bytree = c(.5, .8),
gamma = 0,
min_child_weight = 1,
subsample = .6)
control <- trainControl(method = "cv",
number = 4,
classProbs = TRUE,
sampling = c("up", "down"))
caravan.boost <- train(formula = Purchase ~ .,
data = data.Caravan.train,
method = "xgbTree",
metric = "Accuracy",
trControl = control,
tuneGrid = grid)
如果从trainControl中删除采样方法,则会收到一个新错误,指出“度量标准精度不适用于回归模型”。如果删除“准确性”指标,则会显示错误消息
Error: sampling methods are only implemented for classification problems
最终的问题是,即使目标变量设置为因子变量且classProbs设置为TRUE,插入符号也将问题定义为回归而不是分类。有人可以解释如何告诉插入符号进行分类而不是回归吗?
答案 0 :(得分:0)
caret::train
没有一个formula
参数,而是一个您可以在其中指定公式的form
参数。因此,例如,这可行:
caravan.boost <- train(form = Purchase ~ .,
data = data.Caravan.train,
method = "xgbTree",
metric = "Accuracy",
trControl = control,
tuneGrid = grid)
#output:
eXtreme Gradient Boosting
1000 samples
85 predictor
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 751, 749, 750, 750
Addtional sampling using up-sampling
Resampling results across tuning parameters:
eta max_depth colsample_bytree Accuracy Kappa
0.01 1 0.5 0.7020495 0.10170007
0.01 1 0.8 0.7100335 0.09732773
0.01 2 0.5 0.7730581 0.12361444
0.01 2 0.8 0.7690620 0.11293561
0.01 3 0.5 0.8330506 0.14461709
0.01 3 0.8 0.8290146 0.06908344
0.01 4 0.5 0.8659949 0.07396586
0.01 4 0.8 0.8749790 0.07451637
0.01 5 0.5 0.8949792 0.07599005
0.01 5 0.8 0.8949792 0.07525191
0.01 6 0.5 0.9079873 0.09766492
0.01 6 0.8 0.9099793 0.10420720
0.01 7 0.5 0.9169833 0.11769151
0.01 7 0.8 0.9119753 0.10873268
0.05 1 0.5 0.7640699 0.08281792
0.05 1 0.8 0.7700580 0.09201503
0.05 2 0.5 0.8709909 0.09034807
0.05 2 0.8 0.8739990 0.10440898
0.05 3 0.5 0.9039792 0.12166348
0.05 3 0.8 0.9089832 0.11850402
0.05 4 0.5 0.9149793 0.11602447
0.05 4 0.8 0.9119713 0.11207786
0.05 5 0.5 0.9139633 0.11853793
0.05 5 0.8 0.9159754 0.11968085
0.05 6 0.5 0.9219794 0.11744643
0.05 6 0.8 0.9199794 0.12803204
0.05 7 0.5 0.9179873 0.08701058
0.05 7 0.8 0.9179793 0.10702619
Tuning parameter 'nrounds' was held constant at a value of 500
Tuning parameter 'gamma' was held constant
at a value of 0
Tuning parameter 'min_child_weight' was held constant at a value of 1
Tuning
parameter 'subsample' was held constant at a value of 0.6
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 500, max_depth = 6, eta = 0.05, gamma =
0, colsample_bytree = 0.5, min_child_weight = 1 and subsample = 0.6.
您还可以使用非公式界面,在其中分别指定x
和y
:
caravan.boost <- train(x = data.Caravan.train[,-ncol(data.Caravan.train)],
y = data.Caravan.train$Purchase,
method = "xgbTree",
metric = "Accuracy",
trControl = control,
tuneGrid = grid)
请注意,当x
中存在因子变量时,由于大多数算法的公式接口都调用model.matrix
,因此这两种规范方法并不总是产生相同的结果。
要获取数据:
library(ISLR)
data(Caravan)