列车组中有303个数据点(见图)。其中许多点在Y轴上等于0。
现在我想训练GBM模型来预测Y值。这是我的模特:
train.subset<- data.frame(yval=train$yval,
hour=train$hour,
daymoment=train$daymoment,
year=train$year,
log.windspeed=log(train$windspeed+1),
weather=train$weather,
workingday=train$workingday,
log.temp=log(train$temp+1),
log.atemp=log(train$atemp+1),
log.humidity=log(train$humidity+1))
inTrain <- caret::createDataPartition(train.subset$registered,
p = .85, list = FALSE)
train.registered <- train.subset[inTrain, ]
cv.registered <- train.subset[-inTrain, ]
fitControl <- trainControl(## 5-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = (5:25)*50,
shrinkage = 0.1)
fit.registered <- train(registered ~., data=train.registered, method = "gbm",trControl = fitControl,verbose = FALSE,tuneGrid = gbmGrid)
prediction.registered<-predict(fit.registered, newdata = cv.registered)
prediction.registered[prediction.registered<0] <- min(prediction.registered[prediction.registered > 0])
RMSE <- sqrt(mean((prediction.registered - cv.registered$registered)^2))
RMSE
然后我得到了相当高的RMSE值:~28。
以下是显示交叉验证集的预测和实际yval
的图表。
我不明白为什么这条相对简单的曲线存在这么大的错误。任何的想法?也许我应该使用caret
找到的调整参数尝试另一个包?
如果此信息有用,以防万一:
> summary(fit.registered)
var rel.inf
hour hour 23.385420
log.atemp log.atemp 12.959972
daymoment.C daymoment.C 11.605700
log.humidity log.humidity 10.972162
log.windspeed log.windspeed 9.627754
daymoment.L daymoment.L 7.517074
daymoment^4 daymoment^4 4.658695
log.temp log.temp 4.567798
workingday workingday 4.135300
daymoment.Q daymoment.Q 3.766462
year year 3.763452
weather weather 3.040211
更新: