TL; DR :我正在尝试理解GradientBoostingClassifier
的train_score_
属性的含义,具体说明为什么它不符合我以下的尝试直接计算:
my_train_scores = [clf.loss_(y_train, y_pred) for y_pred in clf.staged_predict(X_train)]
更多细节:我对分类器不同拟合阶段的测试和火车数据的损失分数感兴趣。 I can使用staged_predict
和loss_
来计算测试数据的损失分数:
test_scores = [clf.loss_(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
我没关系。我的问题是列车损失分数。文档建议使用 clf.train_score_
:
第i个得分train_score_ [i]是模型的偏差(=损失) 在迭代我对袋内样品。如果subsample == 1这就是 偏离训练数据。
但这些clf.train_score_
值与我在上面的my_train_scores
中直接计算它们的尝试不匹配。我在这里缺少什么?
我使用的代码:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
X, y = make_hastie_10_2()
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = GradientBoostingClassifier(n_estimators=5, loss='deviance')
clf.fit(X_train, y_train)
test_scores = [clf.loss_(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
print test_scores
print clf.train_score_
my_train_scores = [clf.loss_(y_train, y_pred) for y_pred in clf.staged_predict(X_train)]
print my_train_scores, '<= NOT the same values as in the previous line. Why?'
制作例如这个输出......
[0.71319004170311229, 0.74985670836977902, 0.79319004170311214, 0.55385670836977885, 0.32652337503644546]
[ 1.369166 1.35366377 1.33780865 1.32352935 1.30866325]
[0.65541226392533436, 0.67430115281422309, 0.70807893059200089, 0.51096781948088987, 0.3078567083697788] <= NOT the same values as in the previous line. Why?
...最后两行不匹配。
答案 0 :(得分:0)
通过以下方式重新创建属性self.train_score_
:
test_dev = []
for i, pred in enumerate(clf.staged_decision_function(X_test)):
test_dev.append(clf.loss_(y_test, pred))
ax = plt.gca()
ax.plot(np.arange(clf.n_estimators) + 1, test_dev, color='#d7191c', label='Test', linewidth=2, alpha=0.7)
ax.plot(np.arange(clf.n_estimators) + 1, clf.train_score_, color='#2c7bb6', label='Train', linewidth=2, alpha=0.7, linestyle='--')
ax.set_xlabel('n_estimators')
plt.legend()
plt.show()
请参阅下面的结果。注意,曲线是彼此重叠的,因为训练和测试数据是相同的数据。