接收分类指标无法处理多类confusion_matrix的混合

时间:2018-03-16 20:58:06

标签: python scikit-learn

在我交叉验证我的训练数据集后 - 我开始遇到混淆矩阵问题。我的X_Train形状显示(835,5),我的y_train形状显示(835,)。当我的数据混合时,我无法使用此方法。否则,它之前的模块工作得很好。我的代码写在下面。如何设置训练数据以使用confusion_matrix方法?

cross_validate / cross_val_score模块

from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
lasso = linear_model.Lasso()
cross_validate_results = cross_validate(lasso, X_train, y_train, return_train_score=True)
sorted(cross_validate_results.keys())
cross_validate_results['test_score']
print(cross_val_score(lasso, X_train, y_train))

confusion_matrix模块

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train, X_train)

错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-83-78f76b6bc798> in <module>()
      1 from sklearn.metrics import confusion_matrix
      2 
----> 3 confusion_matrix(y_test, X_test)

~\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in confusion_matrix(y_true, y_pred, labels, sample_weight)
    248 
    249     """
--> 250     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    251     if y_type not in ("binary", "multiclass"):
    252         raise ValueError("%s is not supported" % y_type)

~\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in _check_targets(y_true, y_pred)
     79     if len(y_type) > 1:
     80         raise ValueError("Classification metrics can't handle a mix of {0} "
---> 81                          "and {1} targets".format(type_true, type_pred))
     82 
     83     # We can't have more than one value on y_type => The set is no more needed

ValueError: Classification metrics can't handle a mix of multiclass and multiclass-multioutput targets

打印阵列模块的形状

print(X_train.shape)
print(y_train.shape)
(835, 5)
(835,)

更新: 我现在收到此错误ValueError: Found input variables with inconsistent numbers of samples: [356, 209]

当我运行confusion_matrix(y_train,X_train)

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train, y_pred)

完整错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-3caf00cb052f> in <module>()
      1 from sklearn.metrics import confusion_matrix
      2 
----> 3 confusion_matrix(y_train, y_pred)

~\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in confusion_matrix(y_true, y_pred, labels, sample_weight)
    248 
    249     """
--> 250     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    251     if y_type not in ("binary", "multiclass"):
    252         raise ValueError("%s is not supported" % y_type)

~\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in _check_targets(y_true, y_pred)
     69     y_pred : array or indicator matrix
     70     """
---> 71     check_consistent_length(y_true, y_pred)
     72     type_true = type_of_target(y_true)
     73     type_pred = type_of_target(y_pred)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
    202     if len(uniques) > 1:
    203         raise ValueError("Found input variables with inconsistent numbers of"
--> 204                          " samples: %r" % [int(l) for l in lengths])
    205 
    206 

ValueError: Found input variables with inconsistent numbers of samples: [356, 209]

1 个答案:

答案 0 :(得分:0)

您需要将y传递给混淆矩阵,而不是X(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)。理想情况下,您可以使用sklearn的train_test_split(http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)将一部分数据保留为测试集,并使用您的模型根据测试集预测y。然后你会用

confusion_matrix(y_test, y_pred)

计算混淆矩阵。如果没有测试集,您仍然会使用X_train的分类器的预测方法来获得y_pred。在这种情况下,您将y_train作为真实标签传递,将y_pred作为预测标签传递给混淆矩阵,例如

confusion_matrix(y_train, y_pred)

再次查看您的代码,您的估算工具是一个回归模型(http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso,例如它预测数值,然后您尝试使用混淆矩阵,用于评估分类模型的性能,例如标签的预测程度如何。因此,您应该考虑除confusion_matrix之外的其他指标。

由于您现在决定使用knn,在处理交叉验证之前先尝试以下内容。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

# Assuming your target column is y, otherwise use the appropriate column name
X = df.drop(['y'], axis=1).values.astype('float')
y = df['y'].values.astype('float') # assuming you have label encoded your target variable

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=23, stratify=y)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)