Question

我正在尝试使用此处显示的方法使用Scikit Learn对文本数据进行分类。（http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html），除了我正在加载我自己的数据集。

我得到的结果，但我想找到分类结果的准确性。

    from sklearn.datasets import load_files

    text_data = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/train", description=None, categories=None, load_content=True, shuffle=True, encoding='latin-1', decode_error='ignore', random_state=0)

    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import SGDClassifier
    text_clf = Pipeline([('vect', CountVectorizer()),
                        ('tfidf', TfidfTransformer()),
                        ('clf', LinearSVC(loss='hinge', penalty='l2',
                                                random_state=42)),
    ])

    _ = text_clf.fit(text_data.data, text_data.target)

    docs_new = ["Some test sentence here.",]

    predicted = text_clf.predict(docs_new)
    print np.mean(predicted == text_data.target) 

    for doc, category in zip(docs_new, predicted):
        print('%r => %s' % (doc, text_data.target_names[predicted]))

在这里，我得到np.mean预测为0.566。

如果我尝试：

twenty_test = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/testing", description=None, categories=None, load_content=True, shuffle=True, encoding='latin-1', decode_error='ignore', random_state=0)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

现在它将其打印为1。

我不明白这是如何运作的，以及np.mean到底是什么，以及为什么它在对同一数据进行过培训时会显示不同的结果。

＆＃34;火车＆＃34;文件夹有大约15个文件，文本文件夹也有大约15个文件，以防万一。我是Scikit Learn和机器学习的新手，所以任何帮助都非常感谢。谢谢！

Answer 1

precict()返回给定未知文本的预测类标签的数组。查看来源here。

docs_new = ['God is love', 'OpenGL on the GPU is fast', 'java', '3D', 'Cinema 4D']
predicted = clf.predict(X_new_tfidf)
print predicted
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

[3 1 2 1 1]
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'java' => sci.med
'3D' => comp.graphics
'Cinema 4D' => comp.graphics

如您所见，predicted返回一个数组。数组中的数字对应于标签的索引，这些索引在后续for循环中访问。

当您执行np.mean时，这是为了确定分类器的准确性，并且在您的第一个示例中不适用，因为文本"Some text here"没有标签。但是，这段文字可用于预测这属于哪个标签。这可以通过更改以下内容在您的脚本中实现：

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, text_data.target_names[predicted]))

为：

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, text_data.target_names[category]))

当您对np.mean的第二次调用返回1时，这意味着分类器能够以100％的准确度预测未见文档到正确的标签。因为，twenty_test数据也有标签信息。

要获得有关分类器准确性的更多信息，您可以：

from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names)) 


                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502

如果你想要一个混淆矩阵，你可以：

metrics.confusion_matrix(twenty_test.target, predicted)

array([[258,  11,  15,  35],
       [  4, 379,   3,   3],
       [  5,  33, 355,   3],
       [  5,  10,   4, 379]])

Answer 2

text_data = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/train", ...)

根据the documentation，该行会将您文件的内容从C:/Users/USERNAME/projects/machine_learning/my_project/train加载到text_data.data。它还会将每个文档的目标标签（由其整数索引表示）加载到text_data.target。因此text_data.data应该是字符串列表，text_data.target应该是整数列表。标签源自文件所在的文件夹。您的解释听起来好像C:/.../train/和C:/.../test/中没有任何子文件夹，这可能会产生问题（例如，所有标签都相同）。

from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LinearSVC(loss='hinge', penalty='l2',
                                            random_state=42)),
])

_ = text_clf.fit(text_data.data, text_data.target)

以上几行是在示例文档上训练（在.fit()）分类器。非常粗略地说，您告诉分类器（LinearSVC）哪些单词出现在哪些文档（CountVectorizer，TfidfTransformer）以及每个文档标记哪些（{{1} }）。然后，您的分类器会尝试学习一条规则，该规则基本上将这些词频（TF-IDF值）映射到标签（例如text_data.target和dog，强烈指示标签cat。）

animal

在对示例数据进行分类器训练之后，您将提供一个全新的文档，并让分类器根据所学知识为该文档预测最合适的标签。 docs_new = ["Some test sentence here.",] predicted = text_clf.predict(docs_new)应该是只有一个元素的（标签索引）列表（因为你有一个文件），e。 G。 predicted。

[5]

在这里，您要将预测列表（1个元素）与训练数据中的标签列表（15个元素）进行比较，然后取结果的平均值。这并不是很有意义，因为列表大小不同，并且因为您的新示例文档与培训标签没有任何关系。 Numpy可能会将您预测的标签（例如print np.mean(predicted == text_data.target)）与5中的每个元素进行比较。这将创建一个类似text_data.target的列表，[False, False, False, True, False, True, ...]将np.mean解释为[0, 0, 0, 1, 0, 1, ...，导致平均值为1/15 * (0+0+0+1+0+1+...)。

你应该做的是e。 G。类似的东西：

docs_new = ["Some test sentence here."]
docs_new_labels = [1] # correct label index of the document

predicted = text_clf.predict(docs_new)
print np.mean(predicted == docs_new_labels)

至少你不应该与你的训练标签比较。请注意，如果np.mean返回1，那么所有文档都会正确分类。在您的测试数据集的情况下，似乎发生了。确保您的测试和培训数据文件实际上是不同的，因为100％的准确性并不常见（但可能是您的少量培训文件的工件）。在旁注中，注意当前没有使用标记化，因此对于您的分类器here和here.将是完全不同的单词。

我不确定如何使用Scikit Learn来解释此分类的准确性

2 个答案: