我正在尝试使用scikit-learn 0.12.1来:
这会产生两个问题:
我的问题是,强制分类器识别全部可能类的最佳方法是什么,即使其中一些类没有出现在训练数据中?显然,它无法了解它从未见过数据的标签,但0在我的情况下完全可用。
答案 0 :(得分:8)
这是一种解决方法。确保您有一个名为all_classes
的所有类的列表。然后,如果clf
是您的LogisticRegression
分类器,
from itertools import repeat
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
prob_per_class = (zip(clf.classes_, prob)
+ zip(classes_not_trained, repeat(0.)))
生成(cls, prob)
对的列表。
答案 1 :(得分:3)
如果你想要的是一个像predict_proba
返回的数组,但是对应于已排序的all_classes
的列,那么如何:
all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob
答案 2 :(得分:2)
在larsman的优秀答案的基础上,我最终得到了这个:
from itertools import repeat
import numpy as np
# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)
# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
# put the probabilities in class order
prob_per_class = sorted(prob_per_class)
new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)
new_prob是一个[n_samples,n_classes]数组,就像predict_proba的输出一样,除了它现在包含0个先前未见过的类的概率。