我只运行了1棵树的xgboost模型并将其转储:
booster[0]:
0:[worst_area<884.549988] yes=1,no=2,missing=1,gain=278.707367,cover=106.5
1:[worst_concave_points<0.135800004] yes=3,no=4,missing=3,gain=32.1795197,cover=71.5
3:[mean_area<696.25] yes=7,no=8,missing=7,gain=3.56977844,cover=62.75
7:leaf=0.0952191278,cover=61.75
8:leaf=-0,cover=1
4:[mean_texture<19.7099991] yes=9,no=10,missing=9,gain=13.5565615,cover=8.75
9:leaf=0.0384615399,cover=5.5
10:leaf=-0.0764705911,cover=3.25
2:[mean_concavity<0.0721400008] yes=5,no=6,missing=5,gain=9.40318298,cover=35
5:[mean_texture<19.5449982] yes=11,no=12,missing=11,gain=5.81390381,cover=3.25
11:leaf=0.0454545468,cover=1.75
12:leaf=-0.0600000024,cover=1.5
6:leaf=-0.0969465673,cover=31.75
我假设leaf = some value
,并且此值是该叶子的预测概率。
在上面的树中,该值可以是以下值之一(所有<0.1):
7:leaf=0.0952191278,cover=61.75
8:leaf=-0,cover=1
9:leaf=0.0384615399,cover=5.5
10:leaf=-0.0764705911,cover=3.25
11:leaf=0.0454545468,cover=1.75
12:leaf=-0.0600000024,cover=1.5
6:leaf=-0.0969465673,cover=31.75
但是对训练/测试数据的预测显示出不同的值。为什么会这样?
代码:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc, roc_auc_score, f1_score
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV, ParameterGrid
from sklearn import datasets
breast_cancer = datasets.load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns = pd.Series(breast_cancer.feature_names).str.replace(' ', '_'))
y = pd.Series(breast_cancer.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
params = {'learning_rate': [0.05], 'max_depth': [3], 'n_estimators': [1]}
param_grid = list(ParameterGrid(params))
xgb_model = xgb.XGBClassifier(**param_grid[0])
xgb_model = xgb_model.fit(X_train, y_train)
y_test_hat = xgb_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_hat)
prob = xgb_model.predict_proba(X_train)
prob = pd.Series(prob[:, 1], name = 'prob')
print(prob[0:10])
prob = xgb_model.predict_proba(X_test)
prob = pd.Series(prob[:, 1], name = 'prob')
print(prob[0:10])
xgb_model.get_booster().dump_model('xgb_model.txt', with_stats=True) # see model in the file