Question

我正在使用XGBClassifier来建模不平衡的多类目标。我有几个问题：

First I would like to now where should I use the parameter weight on the instantion of the classifier or on the fit step of the pipeline?

Second question is how I calculate a weights. I assume that the sum of the array should be 1.

Third: Is there any order of the weight array that maps the diferent label classes?

谢谢大家

Answer 1

第一个问题：

我应该在哪里使用参数权重

在sample_weight中使用XGBClassifier.fit()

xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X, y, sample_weight=sample_weight)

使用pipeline时：

pipe = Pipeline([
    ('my_xgb_clf', xgb.XGBClassifier()),
])
pipe.fit(X, y, my_xgb_clf__sample_weight=sample_weight)

顺便说一句，sklearn中的某些API不支持sample_weight kwarg，例如learning_curve。

所以我只是这样做：

import functools
xgb_clf.fit = functools.partial(xgb_clf.fit, sample_weight=sample_weight)

注意：您将需要在网格搜索之后再次修补fit()，因为GridSearchCV.best_estimator_将不是原始估计量。

第二个问题：

我如何计算权重。我假设数组的总和应为1。

from sklearn.utils import compute_sample_weight
sample_weight = compute_sample_weight('balanced', y_train)

这在sklearn中模拟class_weight='balanced'。

注意：

该数组的总和不是1。您可以对其进行归一化，但是我认为得分结果会有所不同。
这不等于class_weight='balanced_subsample' 我找不到模拟此方法的方法。

第三个问题：

有没有命令...

对不起，我不明白你的意思...

也许您想在xgb_clf.classes_中订购？您可以在致电xgb_clf.fit之后访问它。或只使用np.unique(y_train)。

如何在Scikit.learn管道中处理不平衡的xgboost多类分类？

1 个答案: