Question

我正在做一些文本分类。假设我有10个类别和100个“样本”，其中每个样本都是一个文本句子。我将样本分为80:20（训练，测试）并训练了SVM分类器：

text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words=('english'),ngram_range=(1,2))), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', random_state=42, learning_rate='adaptive', eta0=0.9))])

# Fit training data to SVM classifier, predict with testing data and print accuracy
text_clf_svm = text_clf_svm.fit(training_data, training_sub_categories)

现在，关于预测，我不想仅预测一个类别。例如，我想查看给定看不见样本的“前5个”类别的列表及其相关概率：

top_5_category_predictions = text_clf_svm.predict(a_single_unseen_sample)

由于text_clf_svm.predict返回的值表示可用类别的索引，因此我想在输出中看到类似这样的内容：

[(4,0.70),(1,0.20),(7,0.04),(9,0.06)]

有人知道如何实现这一目标吗？

Answer 1

这是我前一段时间用来解决类似问题的东西：

probs = clf.predict_proba(X_test)
# Sort desc and only extract the top-n
top_n_category_predictions = np.argsort(probs)[:,:-n-1:-1]

这将为您提供每个样本的前n个类别。

如果您还想查看与这些类别相对应的概率，则可以执行以下操作：

top_n_probs = np.sort(probs)[:,:-n-1:-1]

注意：这里X_test的形状为(n_samples, n_features)。因此，请确保您以相同的格式使用single_unseen_sample。

SkLearn SVM-如何获得按概率排序的多个预测？

1 个答案: