如何在本实验中添加混淆矩阵和精度度量?

时间:2016-12-09 00:26:04

标签: python scikit-learn jupyter random-forest

我正在使用Python进行机器学习和实验,我希望在我的实验中添加precisión度量和混淆矩阵,我的完整代码如下:

print('Random Forest Testing')

from sklearn import svm
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import csv
from sklearn import preprocessing
from sklearn import svm
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

打开csv:

f = open('Telcel_facebook_comments_train.csv')
csv_f = csv.reader(f)

创建矢量化器tfidf:

vectorizer = TfidfVectorizer(analyzer='char',ngram_range=(1, 3))

列出以保留评论和标签:

list_comments=[]
list_tags=[]
for row in csv_f:
    list_comments.append(row[0])
    list_tags.append(row[1])        
X = vectorizer.fit_transform(list_comments)
print(X)
vectorizadorEtiquetas= preprocessing.LabelEncoder()
Y=vectorizadorEtiquetas.fit_transform(list_tags)
print(Y)

获取功能名称:

tfidf_words=vectorizer.get_feature_names()
clf = svm.SVR()
#Second Machine learning algorithm 
clf2 = RandomForestClassifier(n_estimators=10)
clf2 = clf2.fit(X, Y)
#building X train and Y train matrix
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.33, random_state=47)
print('Starting training')
#clf.fit(X_train, y_train)
clf2.fit(X_train, y_train)
print('Training Completed')
print(clf2.score(X_test, y_test))

导入混淆矩阵并召回

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support

这时我需要添加精度和混淆矩阵,以下代码是错误的,因为我不知道如何获得名为“y_true”的矩阵,我只有三个类:1,2,3

print(precision_recall_fscore_support(y_true, y_pred, average='macro'))
print(confusion_matrix(y_true, y_pred))

除了更清楚之外,这是输出的一部分:

Random Forest Testing
  (0, 2128) 0.225797583675
  (0, 6205) 0.243191128615
  (0, 6366) 0.21798642306
  (0, 3292) 0.204253719304
  (0, 4763) 0.161726027808
  (0, 1950) 0.264734992986
  (0, 6457) 0.264734992986
  (0, 5153) 0.264734992986
  (0, 3216) 0.105568550619
  (0, 4760) 0.128342578419


[3 1 1 ..., 2 2 2]
Starting training
Training Completed
0.881481481481

然而,我想感谢支持,以显示混淆矩阵和召回指标,以了解更多我的模型,感谢您的支持。

这是我实现结果的第二次努力,而不是我试过的上述行:

y_pred = clf2.predict(X_test)
print('Training Completed')


'''
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you
require for each samplethat each label set be correctly predicted.
'''

print(clf2.score(X_test, y_test))

#importing Confusion Matrix and recall
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix

#Here is when I need to add the precision and confusion matrix
print(precision_recall_fscore_support(y_test, y_pred, average='macro'))

print(confusion_matrix(y_test, y_pred))

这是输出:

(0.68431620945676808, 0.61034292763991205, 0.63832235955391514, None)
[[159  83   7   0]
 [  3 811   6   0]
 [  5  22 118   0]
 [  0   1   0   0]]




C:\Program Files\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1074: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

现在的问题是我得到一个4x4的混乱矩阵,我只有三个标签,所以我想在这里得到支持,

1 个答案:

答案 0 :(得分:1)

让我们分解一下以更好地理解这个过程:

  1. 在原始数据集中,您有输入样本X和目标类Y(根据我的理解,您在此处有三个可能的值:1,2和3)。
  2. 当调用train_test_split时,您的输入样本和目标类被分割,生成X_train,X_test,Y_train,Y_test。
  3. 您现在必须使用X_train和Y_train 训练您的模型(并且这是您的代码中存在误解的部分):clf2 = clf2.fit(X_train, Y_train)
  4. 现在模型已经过培训数据的正确培训,您可以在测试子样本上进行实际测试。
  5. 这样做,即可生成您正在寻找的 Y_pred

    Y_pred = clf2.predict(X_test)
    

    Y_pred是一个1d数组,其中包含模型预测的每个元素。 您知道这些类的真正价值在于:Y_test。

    您现在拥有Y_true和Y_test,可以评估您的分类器。

    我希望它有所帮助!