Question

我想弄清楚损失函数公式究竟是什么，以及如果class_weight='auto'，svm.svc和svm.linearSVC linear_model.LogisticRegression，我可以手动计算它。

对于平衡数据，假设您有经过训练的分类器：clf_c。物流损失应该是（我是否正确？）：

def logistic_loss(x,y,w,b,b0):
    '''
    x: nxp data matrix where n is number of data points and p is number of features.
    y: nx1 vector of true labels (-1 or 1).
    w: nx1 vector of weights (vector of 1./n for balanced data).
    b: px1 vector of feature weights.
    b0: intercept.
    '''
    s = y
    if 0 in np.unique(y):
        print 'yes'
        s = 2. * y - 1
    l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
    return l

我意识到logisticRegression有predict_log_proba()，它可以准确地说明数据平衡的时候：

b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)

注意，np.floor((y+1)/2).astype(np.int8)只是将y =（ - 1,1）映射到y =（0,1）。

但是，当数据不平衡时，这不起作用。

更重要的是，当数据处于平衡状态时class_weight=None与数据不平衡且class_weight='auto'时，您希望分类器（此处为logisticRegression）执行类似（在损失函数值方面）。我需要有一种方法来计算两种情况下的损失函数（没有正则化项）并进行比较。

简而言之，class_weight = 'auto' 完全是什么意思？这是class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.}还是class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}？

非常感谢任何帮助。我尝试了解源代码，但我不是程序员而且我被卡住了。非常感谢。

Answer 1

`class_weight`启发式

我对class_weight='auto'启发式的第一个命题感到有些困惑，因为：

class_weight = {-1 : (y == 1).sum() / (y == -1).sum(), 
                1 : 1.}

如果我们将其标准化以使权重总和为1，则

与您的第二个命题相同。

无论如何要了解class_weight="auto"的作用，请看这个问题： what is the difference between class weight = none and auto in svm scikit learn

我在这里复制它以供以后比较：

这意味着你拥有的每个班级（在课堂上）的权重相等 1除以该类在数据中出现的次数（y），因此更经常出现的类会降低权重。这是然后进一步除以所有反类频率的平均值。

请注意这不是很明显;）。

此推荐不推荐使用，将在0.18中删除。它将被另一个启发式class_weight='balanced'替换。

＆＃39;平衡＆＃39;启发式按比例对其频率的倒数进行加权。

来自文档：

＆＃34;平衡＆＃34; mode使用y的值自动调整权重与输入数据中的类频率成反比： n_samples / (n_classes * np.bincount(y))。

np.bincount(y)是一个数组，其中元素i是第i类样本的计数。

这里有一些比较两者的代码：

import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight

n_classes = 3
n_samples = 1000

X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10, 
    n_classes=n_classes, weights=[0.05, 0.4, 0.55])

print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)

print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))

输出：

Count of samples per class:  [ 57 396 547]
Balanced weights:  [ 5.84795322  0.84175084  0.60938452]
'auto' weights:  [ 2.40356854  0.3459682   0.25046327]

损失函数

现在真正的问题是：这些权重如何用于训练分类器？

不幸的是，我在这里没有得到彻底的答案。

对于SVC和linearSVC，文档字符串非常清晰

将类i的参数C设置为SVC的class_weight [i] * C.

如此高的权重意味着该课程的正规化程度较低，并且对svm进行适当分类的激励更高。

我不知道他们如何使用逻辑回归。我将尝试研究它，但大部分代码都是liblinear或libsvm，而且我对它们不太熟悉。

但请注意，class_weight 中的权重不会直接影响predict_proba 等方法。它们改变了它的输出，因为分类器优化了不同的损失函数不确定这是否清楚，所以这里有一个片段来解释我的意思（你需要为导入和变量定义运行第一个）：

lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))

new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))

# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()

# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))

希望这有帮助。

class_weight在linearSVC和LogisticRegression的损失函数中的作用

1 个答案:

`class_weight`启发式

损失函数

class_weight在linearSVC和LogisticRegression的损失函数中的作用

1 个答案:

class_weight启发式

损失函数

`class_weight`启发式