我正在使用此代码执行预测以对文本进行分类:
predicted = clf.predict(X_new_tfidf)
我的预测或者是说文本片段属于主题A或主题B.但是,我想对不稳定的预测进行进一步分析 - 也就是说,如果模型真的不确定它是A还是B,但为了它而不得不选择一个。有没有办法提取预测的相对置信度?
代码:
X_train
有["Sentence I know belongs to Subject A", "Another sentence that describes Subject A", "A sentence about Subject B", "Another sentence about Subject B"...]
等
Y_train
包含相应的分类器:["Subject A", "Subject A", "Subject B", "Subject B", ...]
等。
predict_these_X
是我希望归类的句子列表:["Some random sentence", "Another sentence", "Another sentence again", ...]
等。
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_new_counts = count_vect.transform(predict_these_X)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
estimator = BernoulliNB()
estimator.fit(X_train_tfidf, Y_train)
predictions = estimator.predict(X_new_tfidf)
print estimator.predict_proba(X_new_tfidf)
return predictions
结果:
[[ 9.97388646e-07 9.99999003e-01]
[ 9.99996892e-01 3.10826824e-06]
[ 9.40063326e-01 5.99366742e-02]
[ 9.99999964e-01 3.59816546e-08]
...
[ 1.95070084e-10 1.00000000e+00]
[ 3.21721965e-15 1.00000000e+00]
[ 1.00000000e+00 3.89012777e-10]]
答案 0 :(得分:0)
from sklearn.datasets import make_classification
from sklearn.naive_bayes import BernoulliNB
# generate some artificial data
X, y = make_classification(n_samples=1000, n_features=50, weights=[0.1, 0.9])
# your estimator
estimator = BernoulliNB()
estimator.fit(X, y)
# generate predictions
estimator.predict(X)
Out[164]: array([1, 1, 1, ..., 0, 1, 1])
# to get confidence on the prediction
estimator.predict_proba(X)
Out[163]:
array([[ 0.0043, 0.9957],
[ 0.0046, 0.9954],
[ 0.0071, 0.9929],
...,
[ 0.8392, 0.1608],
[ 0.0018, 0.9982],
[ 0.0339, 0.9661]])
现在你看,对于前三个观察中的每一个,它有超过99%的可能性是积极的情况。