Question

我正在使用此代码执行预测以对文本进行分类：

predicted = clf.predict(X_new_tfidf)

我的预测或者是说文本片段属于主题A或主题B.但是，我想对不稳定的预测进行进一步分析 - 也就是说，如果模型真的不确定它是A还是B，但为了它而不得不选择一个。有没有办法提取预测的相对置信度？

代码：

X_train有["Sentence I know belongs to Subject A", "Another sentence that describes Subject A", "A sentence about Subject B", "Another sentence about Subject B"...]等

Y_train包含相应的分类器：["Subject A", "Subject A", "Subject B", "Subject B", ...]等。

predict_these_X是我希望归类的句子列表：["Some random sentence", "Another sentence", "Another sentence again", ...]等。

    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()

    X_train_counts = count_vect.fit_transform(X_train)
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

    X_new_counts = count_vect.transform(predict_these_X)
    X_new_tfidf = tfidf_transformer.transform(X_new_counts)

    estimator = BernoulliNB()
    estimator.fit(X_train_tfidf, Y_train)
    predictions = estimator.predict(X_new_tfidf)

    print estimator.predict_proba(X_new_tfidf)
    return predictions

结果：

[[  9.97388646e-07   9.99999003e-01]
 [  9.99996892e-01   3.10826824e-06]
 [  9.40063326e-01   5.99366742e-02]
 [  9.99999964e-01   3.59816546e-08]
 ...
 [  1.95070084e-10   1.00000000e+00]
 [  3.21721965e-15   1.00000000e+00]
 [  1.00000000e+00   3.89012777e-10]]

Answer 1

from sklearn.datasets import make_classification
from sklearn.naive_bayes import BernoulliNB

# generate some artificial data
X, y = make_classification(n_samples=1000, n_features=50, weights=[0.1, 0.9])


# your estimator
estimator = BernoulliNB()
estimator.fit(X, y)

# generate predictions
estimator.predict(X)
Out[164]: array([1, 1, 1, ..., 0, 1, 1])

# to get confidence on the prediction
estimator.predict_proba(X)

Out[163]: 
array([[ 0.0043,  0.9957],
       [ 0.0046,  0.9954],
       [ 0.0071,  0.9929],
       ..., 
       [ 0.8392,  0.1608],
       [ 0.0018,  0.9982],
       [ 0.0339,  0.9661]])

现在你看，对于前三个观察中的每一个，它有超过99％的可能性是积极的情况。

Sklearn for Python：有没有办法看到预测的接近程度？

1 个答案: