Question

我想对不平衡的分类问题使用sklearn.ensemble.GradientBoostingClassifier。我打算优化Area Under the Receiver Operating Characteristic Curve (ROC AUC)。为此，我想重新调整我的类，使小类对分类器更重要。

通常可以通过设置class_weight =“balanced”来完成（例如在RandomForestClassifier中），但GradientBoostingClassifier中没有这样的参数。

文档说：

“平衡”模式使用y的值自动调整与输入数据中的类频率成反比的权重，如n_samples /（n_classes * np.bincount（y））

如果y_train是我的目标的数据框，其中的元素在{0,1}中，则文档暗示这应该与class_weight =“balanced”重现相同

sample_weight = y_train.shape[0]/(2*np.bincount(y_train))
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train,sample_weight = sample_weight[y_train.values])

这是正确的还是我错过了什么？

Answer 1

我建议您在scikit-learn中使用class_weight.compute_sample_weight实用程序。例如：

from sklearn.utils.class_weight import compute_sample_weight
y = [1,1,1,1,0,0,1]
compute_sample_weight(class_weight='balanced', y=y)

输出：

array([ 0.7 ,  0.7 ,  0.7 ,  0.7 ,  1.75,  1.75,  0.7 ])

您可以将其用作sample_weight关键字的输入。

在scikit-learn中，sample_weight如何与class_weight进行比较？

1 个答案: