因此,在GBM中,每个树都会预测“伪残差”'先前的树 [1]。
我不确定这些'伪残差'工作,但我想知道当你有一个组合时,这是如何发挥作用:
在下面的例子中,我们全部都是3.我将残差计算为Actual - Probability
,并且由于响应是二进制的,所以最终得到的这种高度双模态分布与响应几乎相同。
降低响应率进一步加剧了双模态分布,因为概率接近零,因此,分布更接近于0或1.
所以我在这里有几个问题:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
train_percent = 0.8
num_rows = 10000
remove_rate = 0.1
# Generate data
X, y = make_classification(n_samples=num_rows, flip_y=0.55)
# Remove response rows to make sample unbalanced
remove = (np.random.random(len(y)) > remove_rate) & (y == 1)
X, y = X[~remove], y[~remove]
print("Response Rate: " + str(sum(y) / float(len(y))))
# Get train/test samples (data is pre-shuffled)
train_rows = int(train_percent * len(X))
X_train , y_train = X[:train_rows], y[:train_rows]
X_test , y_test = X[train_rows:], y[train_rows:]
# Fit a simple decision tree
clf = DecisionTreeClassifier(max_depth=4)
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)[:,1]
# Calculate roc auc
roc_auc = roc_auc_score(y_test, pred)
print("ROC AUC: " + str(roc_auc))
# Plot residuals
plt.style.use('ggplot')
plt.hist(y_test - pred);
plt.title('Residuals')