归一化(或缩放)对使用梯度树增强进行回归有用吗?

时间:2018-08-31 11:03:44

标签: python machine-learning regression xgboost gamma-distribution

我读到使用梯度树增强时不需要规范化(例如参见Should I need to normalize (or scale) the data for Random forest (drf) or Gradient Boosting Machine (GBM) in H2O or in general?https://github.com/dmlc/xgboost/issues/357)。

我想我理解原则上,在增强回归树时不需要进行标准化。

尽管如此,将xgboost用于回归树,我看到缩放目标对预测结果的(样本中)误差有重大影响。是什么原因呢?

波士顿住房数据集的示例:

import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

boston = load_boston()
y = boston['target']
X = boston['data']

for scale in np.logspace(-6, 6, 7):
    xgb_model = xgb.XGBRegressor().fit(X, y / scale)
    y_predicted = xgb_model.predict(X) * scale
    print('{} (scale={})'.format(mean_squared_error(y, y_predicted), scale))

2.3432734454908335 (scale=1e-06)
2.343273977065266 (scale=0.0001)
2.3432793874455315 (scale=0.01)
2.290595204136888 (scale=1.0)
2.528513393507719 (scale=100.0)
7.228978353091473 (scale=10000.0)
272.29640759874474 (scale=1000000.0)

使用'reg:gamma'作为目标函数(而不是默认的'reg:linear')时,缩放y的影响确实很大:

for scale in np.logspace(-6, 6, 7):
    xgb_model = xgb.XGBRegressor(objective='reg:gamma').fit(X, y / scale)
    y_predicted = xgb_model.predict(X) * scale
    print('{} (scale={})'.format(mean_squared_error(y, y_predicted), scale))

591.6509503519147 (scale=1e-06)
545.8298971540023 (scale=0.0001)
37.68688286293508 (scale=0.01)
4.039819858716935 (scale=1.0)
2.505477263590776 (scale=100.0)
198.94093800190453 (scale=10000.0)
592.1469169959003 (scale=1000000.0)

0 个答案:

没有答案