为什么多标签分类(二元相关性)正在起作用?

时间:2016-10-26 09:56:10

标签: machine-learning classification multilabel-classification

我是使用二元相关性进行多标签分类的新手,并且有一些问题可以解释结果:

结果是:     [[0. 0.]      [2. 2。]]

这是否意味着第一个案例被分类[0,0]而第二个案例被[2,2]?这看起来并不好看。还是我错过了别的什么?

在gentelmen现在回答后,我收到以下错误 由于y_train标签[2 **,0,** 3,4]因为零

Traceback (most recent call last):
File "driver.py", line 22, in <module>
clf_dict[i] = clf.fit(x_train, y_tmp)
File "C:\Users\BaderEX\Anaconda22\lib\site-packages\sklearn\linear_model\logistic.py", line 1154, in fit
self.max_iter, self.tol, self.random_state)
File "C:\Users\BaderEX\Anaconda22\lib\site-packages\sklearn\svm\base.py", line 885, in _fit_liblinear
" class: %r" % classes_[0])
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1

更新的代码:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *

numer_classes = 5

x_train = np.array([[1,2,3,4],[0,1,2,1],[1,2,0,3]])
y_train = [[0],[1,0,3],[2,0,3,4]]

x_test = np.array([[1,2,3,4],[0,1,2,1],[1,2,0,3]])
y_test = [[0],[1,0,3],[2,0,3,4]]

clf_dict = {}
for i in range(numer_classes):
    y_tmp = []
    for j in range(len(y_train)):
        if i in y_train[j]:
            y_tmp.append(1)
        else:
            y_tmp.append(0)
    clf = LogisticRegression()
    clf_dict[i] = clf.fit(x_train, y_tmp)

prediction_matrix = np.zeros((len(x_test),numer_classes))
for i in range(numer_classes):
    prediction = clf_dict[i].predict(x_test)
    prediction_matrix[:,i] = prediction    

print('Predicted')
print(prediction_matrix)

由于

2 个答案:

答案 0 :(得分:3)

对于二元相关性,您应该为每个标签制作指标类:0或1。 http://trzebinski.info/noip-daemon-autostart-after-system-restart-on-raspberry-pi-raspbian/提供了与分类器兼容的scikit。

集:

def to_indicator_matrix(y_list):
    y_train_matrix = np.zeros(shape=(len(y_list), max(map(len, y_list))+1), dtype='i8')
    for i, y in enumerate(y_list):
        y_train_matrix[i][y] = 1
    return y_train_matrix

鉴于你的y_train和y_test,运行:

y_train = to_indicator_matrix(y_train)
y_test =  to_indicator_matrix(y_test)

你的y_train现在是:

array([[1, 1, 0],
       [0, 1, 1],
       [1, 0, 1]])

这应该可以解决您的问题。使用scikit-multilearn BinaryRelevance然后使用您自己的代码更为舒适。试试吧!

运行

pip install scikit-multilearn

然后尝试

from skmultilearn.problem_transform import BinaryRelevance
from sklearn.linear_model import LogisticRegression
import sklearn.metrics

# assume data is loaded using
# and is available in X_train/X_test, y_train/y_test

# initialize Binary Relevance multi-label classifier
# with gaussian naive bayes base classifier
classifier = BinaryRelevance(LogisticRegression(C=40,class_weight='balanced'), require_dense)

# train
classifier.fit(X_train, y_train)

# predict
predictions = classifier.predict(X_test)

# measure
print(sklearn.metrics.hamming_loss(y_test, predictions))

答案 1 :(得分:2)

我认为你在实施中犯了一个错误。对于二进制相关性,我们需要为每个标签分别设置一个分类器。有三个标签,因此应该有3个分类器。每个分类器都会告诉实例属于某个类的天气。例如,分类器对应于类1(clf [1])将只告诉实例属于类1的天气。

因此,如果要手动实现二进制相关性,在创建分类器的循环中,标签应该二进制化:

for i in range(numer_classes):
    y_tmp = []
    for j in range(len(y_train)):
        if i in y_train[j]:
            y_tmp.append(1)
        else:
            y_tmp.append(0)
    clf = LogisticRegression()
    clf_dict[i] = clf.fit(x_train, y_tmp)

但是,如果你使用sklearn,事情会更方便:

from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

binarizer = MultiLabelBinarizer()
y_train_binarized = binarizer.fit_transform(y_train)
y_test_binarized = binarizer.fit_transform(y_test)
cls = OneVsRestClassifier(estimator=LogisticRegression())
cls.fit(x_train,y_train_binarized)
y_predict = cls.predict(x_test)

结果如下: [[1 0 1]  [0 1 1]] 这意味着第一种情况预测为:[0,2],第二种情况预测为[1,2]