Question

我第一次尝试xgboost，即使使用GridSearchCV，我的模型也确实表现不佳。

我的首要目标：通过用NaN值替换测试数据集的行，实现一个可以处理NaN值的xgboost模型，并将没有NaN值的测试数据集的性能与经过修改的测试数据集进行比较。我想知道该模型可以在不计分的情况下处理多少个NaN值。

我的数据集标题：

前三列均为“一个”，并且要用NaN值替换。它们是固定值。
CLV是目标值

location ..，email ..和sms ..为1，代表是，为0，代表否。
年龄...并转换...继承多个值
这五列应依次替换为NaN值。

现在我拆分数据集，定义参数网格，定义GridSearch并最终通过best_estimator_进行预测：

import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error

data_frame = pd.read_csv('final_id_dataset.csv', index_col=0)
data_frame.shape
>> (4349, 9)

X, y = data_frame.iloc[:,:-1], data_frame.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

params = {'min_child_weight':[4,5,6,7],
      'gamma':[i/10.0 for i in range(2,8)], 
      'subsample':[i/10.0 for i in range(5,11)],
      'colsample_bytree':[i/10.0 for i in range(5,11)],
      'max_depth': [2,3,4,5,6,7],
      'n_estimators': [5, 10, 15, 20]}

xgb = XGBRegressor(objective="reg:squarederror", nthread=2)
grid = GridSearchCV(xgb, params, cv=5, verbose=3, n_jobs=2)
grid.fit(X_train, y_train)

preds = grid.best_estimator_.predict(X_test)
print(r2_score(y_test, preds, multioutput='variance_weighted'))
>> 0.3135124512955303

print(np.sqrt(mean_squared_error(y_test, preds)))
>> 138.52291264914857

因此，要开始实际的实验，我需要一个性能良好的模型，但是我无法弄清楚为获得一个良好的评分xgboost模型需要进行哪些更改。数据集仅仅是太小还是我做错了什么？

XGBoost模型->性能不佳

0 个答案: