XGBoost输入数据问题

时间:2017-02-23 15:59:06

标签: r xgboost

我有一个时间序列,每月有粒度和7个月的数据,我试图通过在前六个月训练来预测第7个月的盈利能力。我对数据进行了80/20分割。 XGBoost提供了一个极低的RMSE,这是我从其他算法中得不到的。这让我有点怀疑。因此,我决定检查哪些功能最重要,哪些会导致数字而不是功能列表。这让我怀疑我没有正确地将数据输入算法。我为noob问题道歉,但我想我是一个人。非常感谢帮助。

require(caTools)
require(Matrix)
require(data.table)
require(xgboost)
set.seed(111) 
sample = sample.split(new_flat$SUBSCRIPTION_ID, SplitRatio = .80)
train = subset(new_flat, sample == TRUE)
train <- subset( train, select = -SUBSCRIPTION_ID ) #Removing Subscription_id
test = subset(new_flat, sample == FALSE)
test <- subset( test, select = -SUBSCRIPTION_ID ) #Removing Subscription_id
target=test$Total_MARGIN_7 #Value I want to predict in the test set
dtrain <- xgb.DMatrix(data = as.matrix(train), label = train[,7])# I think this is the problem here
dtest <- xgb.DMatrix(data = as.matrix(test), label = test[,7]) ])# I think this is the problem here

bst <- xgboost(data = dtrain, max_depth = 5, eta = 1, nrounds = 20, 
               objective = "reg:linear")
pred <- predict(bst, dtest)
mean(pred)
RMSE <- sqrt(mean((as.numeric(target) - pred)^2)) # Yes as.numeric is redundant here
RMSE

1 个答案:

答案 0 :(得分:0)

非常好&#34;好&#34;由于输入数据中的作弊,性能经常发生。在这里,必须删除因变量:

[(len(range(5)) - x, x) for x in range(5)]