训练数据的偏差损失分数与clf.train_score_不匹配

时间:2018-02-20 16:05:44

标签: python scikit-learn classification loss model-fitting

TL; DR :我正在尝试理解GradientBoostingClassifiertrain_score_属性的含义,具体说明为什么它不符合我以下的尝试直接计算:

my_train_scores = [clf.loss_(y_train, y_pred) for y_pred in clf.staged_predict(X_train)]

更多细节:我对分类器不同拟合阶段的测试和火车数据的损失分数感兴趣。 I can使用staged_predictloss_来计算测试数据的损失分数:

test_scores = [clf.loss_(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]

我没关系。我的问题是列车损失分数。文档建议使用 clf.train_score_

  

第i个得分train_score_ [i]是模型的偏差(=损失)   在迭代我对袋内样品。如果subsample == 1这就是   偏离训练数据。

但这些clf.train_score_我在上面的my_train_scores中直接计算它们的尝试不匹配。我在这里缺少什么?

我使用的代码

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
X, y = make_hastie_10_2()
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = GradientBoostingClassifier(n_estimators=5, loss='deviance')
clf.fit(X_train, y_train)

test_scores = [clf.loss_(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
print test_scores
print clf.train_score_
my_train_scores = [clf.loss_(y_train, y_pred) for y_pred in clf.staged_predict(X_train)]
print my_train_scores, '<= NOT the same values as in the previous line. Why?'

制作例如这个输出......

[0.71319004170311229, 0.74985670836977902, 0.79319004170311214, 0.55385670836977885, 0.32652337503644546]
[ 1.369166    1.35366377  1.33780865  1.32352935  1.30866325]
[0.65541226392533436, 0.67430115281422309, 0.70807893059200089, 0.51096781948088987, 0.3078567083697788] <= NOT the same values as in the previous line. Why?

...最后两行不匹配。

1 个答案:

答案 0 :(得分:0)

通过以下方式重新创建属性self.train_score_

test_dev = []
for i, pred in enumerate(clf.staged_decision_function(X_test)):
    test_dev.append(clf.loss_(y_test, pred))

ax = plt.gca()
ax.plot(np.arange(clf.n_estimators) + 1, test_dev, color='#d7191c', label='Test', linewidth=2, alpha=0.7)
ax.plot(np.arange(clf.n_estimators) + 1, clf.train_score_, color='#2c7bb6',    label='Train', linewidth=2, alpha=0.7, linestyle='--')
ax.set_xlabel('n_estimators')
plt.legend()
plt.show()

请参阅下面的结果。注意,曲线是彼此重叠的,因为训练和测试数据是相同的数据。

enter image description here