当我使用线性SVM处理scikit-learn中的分类问题时,我可以将自定义权重应用于每个训练样本,如下所示:
from sklearn.linear_model import SGDClassifier
X = [[0.0, 0.0], [1.0, 1.0]]
y = [0, 1]
sample_weight = [1.0, 0.5]
clf = SGDClassifier(loss="hinge")
clf.fit(X, y, sample_weight=sample_weight)
现在,当我有一个多标签分类任务时,我需要转换标签,而SGDClassifier
必须包含在OneVsRestClassifier
之类的元估算中:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
X = [[0.0, 0.0], [1.0, 1.0], [1.0, 0.0]]
y = [[0], [1], [0, 1]]
y_mlb = MultiLabelBinarizer().fit_transform(y)
sample_weight = [1.0, 0.5, 0.8]
clf = OneVsRestClassifier(SGDClassifier(loss="hinge"))
clf.fit(X, y_mlb) # unable to pass `sample_weight`
但是,除了OneVsRestClassifier
和fit
之外,X
不允许我将任何参数传递给y
方法,因此我无法应用样本权重,就像我之前做的那样。如何在这种情况下应用我自己的样本权重?
答案 0 :(得分:1)
而是尝试子类化OneVsRestClassifier来更改fit方法以允许传递sample_weight。 您需要更改其中使用的fit()和_fit_binary()方法。
尝试将source from here编辑为:
import warnings
import numpy as np
from sklearn.externals.joblib import Parallel, delayed
from sklearn.base import clone
from sklearn.multiclass import _ConstantPredictor, OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer
from sklearn.linear_model import SGDClassifier
def _fit_binary_new(estimator, X, y, sample_weight, classes=None):
unique_y = np.unique(y)
if len(unique_y) == 1:
if classes is not None:
if y[0] == -1:
c = 0
else:
c = y[0]
warnings.warn("Label %s is present in all training examples." %
str(classes[c]))
estimator = _ConstantPredictor().fit(X, unique_y)
else:
estimator = clone(estimator)
# Only this changed
estimator.fit(X, y, sample_weight=sample_weight)
return estimator
class OneVsRestClassifierNew(OneVsRestClassifier):
def fit(self, X, y, sample_weight=None):
self.label_binarizer_ = LabelBinarizer(sparse_output=True)
Y = self.label_binarizer_.fit_transform(y)
Y = Y.tocsc()
self.classes_ = self.label_binarizer_.classes_
columns = (col.toarray().ravel() for col in Y.T)
self.estimators_ = Parallel(n_jobs=self.n_jobs)(delayed(_fit_binary_new)(
self.estimator, X, column, sample_weight, classes=[
"not %s" % self.label_binarizer_.classes_[i],
self.label_binarizer_.classes_[i]])
for i, column in enumerate(columns))
return self
X = [[0.0, 0.0], [1.0, 1.0], [1.0, 0.0]]
y = [[0], [1], [0, 1]]
y_mlb = MultiLabelBinarizer().fit_transform(y)
sample_weight = [1.0, 0.5, 0.8]
clf = OneVsRestClassifierNew(SGDClassifier(loss="hinge"))
clf.fit(X, y_mlb, sample_weight=sample_weight)
clf.predict(X)
# Output: array([[1, 0],
# [0, 1],
# [1, 1]])
注意:这只适用于那些在fit()方法中定义了sample_weight的分类器,因为我没有检查_fit_binary_new()
中是否存在。