Question

我正在进行多标签分类，我正在尝试预测正确的问题标签：

（X =问题，y =来自X的每个问题的标签列表）。

我很想知道，sklearn.svm.SVC OneVsRestClassifier decision_function_shape应与multi-label classification一起使用？

从文档中我们可以看到decision_function_shape可以有两个值'ovo'和'ovr'：

decision_function_shape ：'ovo'，'ovr'或无，默认=无

是否返回形状（n_samples，n_classes）的one-vs-rest（'ovr'）决策函数作为所有其他分类器，或原始   具有形状的libsvm的一对一（'ovo'）决策函数   （n_samples，n_classes *（n_classes - 1）/ 2）。默认值为None   目前表现为'ovo'以实现向后兼容并提升a   弃用警告，但会在0.19中更改'ovr'。

但我仍然不明白有什么区别：

# First decision_function_shape set to 'ovo'
estim = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape ='ovo'))

# Second decision_function_shape set to 'ovr'
estim = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape ='ovr'))

哪个decision_function_shape应该用于Question问题？

编辑： spec提出类似问题但没有答案。

Answer 1

我认为应该使用哪个问题最好留给情境。这可能很容易成为您的GridSearch的一部分。但直觉上我会觉得，就差异而言，你将会做同样的事情。这是我的理由：

OneVsRestClassifier旨在独立地针对所有其他类对每个类进行建模，并为每种情况创建一个分类器。我理解这个过程的方式是OneVsRestClassifier抓取一个类，并为点是否是该类创建二进制标签。然后，这个标签被输入您选择使用的任何估算器。我相信混淆是因为SVC也允许你做出同样的选择，但实际上这个实现的选择并不重要，因为你总是只将两个类输入SVC。

这是一个例子：

from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

data = load_iris()

X, y = data.data, data.target
estim1 = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape='ovo'))
estim1.fit(X,y)

estim2 = OneVsRestClassifier(SVC(kernel='linear', decision_function_shape='ovr'))
estim2.fit(X,y)

print(estim1.coef_ == estim2.coef_)
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

因此，您可以看到两个模型构建的所有三个估算器的系数都相等。假设这个数据集只有150个样本和3个类，因此对于更复杂的数据集，这些结果可能会有所不同，但这是一个简单的概念证明。

Answer 2

决策函数的形状不同，因为ovo为每个 2对类组合训练分类器，而ovr为每个类别训练一个分类器类。

我能找到的最好的例子是found here on http://scikit-learn.org：

SVC 和 NuSVC 实施“一对一”方法（Knerr et al。， 1990）用于多类分类。如果n_class是数字类，然后构造n_class * (n_class - 1) / 2分类器每个人都训练两个班级的数据。提供一致的与其他分类器的接口，decision_function_shape选项允许聚合“一对一”分类器的结果到形状的决策函数（n_samples，n_classes）

>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(decision_function_shape='ovo')
>>> clf.fit(X, Y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> clf.decision_function_shape = "ovr"
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes
4

这简单来说意味着什么？

要了解n_class * (n_class - 1) / 2的含义，请使用itertools.combinations生成两类组合。

def ovo_classifiers(classes):
    import itertools
    n_class = len(classes)
    n = n_class * (n_class - 1) / 2
    combos = itertools.combinations(classes, 2)
    return (n, list(combos))

>>> ovo_classifiers(['a', 'b', 'c'])
(3.0, [('a', 'b'), ('a', 'c'), ('b', 'c')])
>>> ovo_classifiers(['a', 'b', 'c', 'd'])
(6.0, [('a', 'b'), ('a', 'c'), ('a', 'd'), ('b', 'c'), ('b', 'd'), ('c', 'd')])

哪个估算器应该用于多标签分类？

在您的情况下，您有一个带有多个标签的问题（例如StackOverflow上的问题）。如果你事先知道你的标签（类），我可能会建议你OneVsRestClassifier(LinearSVC())，但你可以尝试使用DecisionTreeClassifier或RandomForestClassifier（我认为）：

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import SVC, LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier

df = pd.DataFrame({
  'Tags': [['python', 'pandas'], ['c#', '.net'], ['ruby'],
           ['python'], ['c#'], ['sklearn', 'python']],
  'Questions': ['This is a post about python and pandas is great.',
           'This is a c# post and i hate .net',
           'What is ruby on rails?', 'who else loves python',
           'where to learn c#', 'sklearn is a python package for machine learning']},
                  columns=['Questions', 'Tags'])

X = df['Questions']
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['Tags'].values)

pipeline = Pipeline([
  ('vect', CountVectorizer(token_pattern='|'.join(mlb.classes_))),
  ('linear_svc', OneVsRestClassifier(LinearSVC()))
  ])
pipeline.fit(X, y)

final = pd.DataFrame(pipeline.predict(X), index=X, columns=mlb.classes_)

def predict(text):
  return pd.DataFrame(pipeline.predict(text), index=text, columns=mlb.classes_)

test = ['is python better than c#', 'should i learn c#',
        'should i learn sklearn or tensorflow',
        'ruby or c# i am a dinosaur',
        'is .net still relevant']
print(predict(test))

输出：

                                      .net  c#  pandas  python  ruby  sklearn
is python better than c#                 0   1       0       1     0        0
should i learn c#                        0   1       0       0     0        0
should i learn sklearn or tensorflow     0   0       0       0     0        1
ruby or c# i am a dinosaur               0   1       0       0     1        0
is .net still relevant                   1   0       0       0     0        0

使用OneVsRestClassifier时，sklearn.svm.SVC的decision_function_shape是哪个？

2 个答案: