Question

我正在尝试使用scikit-learn 0.12.1来：

训练LogisticRegression分类器
评估保留验证数据的分类器
将新数据提供给此分类器，并为每次观察检索5个最可能的标签

这会产生两个问题：

标签矢量图在验证数据中出现时，无法识别以前看不见的标签。通过将贴标机安装到可能的标签集上可以很容易地解决这个问题，但这会加剧问题2。
LogisticRegression分类器的predict_proba方法的输出是[n_samples，n_classes]数组，其中n_classes仅包含训练数据中所见类的。这意味着在predict_proba数组上运行argsort不再提供直接映射到标签矢量化器词汇表的值。

我的问题是，强制分类器识别全部可能类的最佳方法是什么，即使其中一些类没有出现在训练数据中？显然，它无法了解它从未见过数据的标签，但0在我的情况下完全可用。

Answer 1

这是一种解决方法。确保您有一个名为all_classes的所有类的列表。然后，如果clf是您的LogisticRegression分类器，

from itertools import repeat

# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)

# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
for row in prob:
    prob_per_class = (zip(clf.classes_, prob)
                    + zip(classes_not_trained, repeat(0.)))

生成(cls, prob)对的列表。

Answer 2

如果你想要的是一个像predict_proba返回的数组，但是对应于已排序的all_classes的列，那么如何：

all_classes = numpy.array(sorted(all_classes))
# Get the probabilities for learnt classes
prob = clf.predict_proba(test_samples)
# Create the result matrix, where all values are initially zero
new_prob = numpy.zeros((prob.shape[0], all_classes.size))
# Set the columns corresponding to clf.classes_
new_prob[:, all_classes.searchsorted(clf.classes_)] = prob

Answer 3

在larsman的优秀答案的基础上，我最终得到了这个：

from itertools import repeat
import numpy as np

# determine the classes that were not present in the training set;
# the ones that were are listed in clf.classes_.
classes_not_trained = set(clf.classes_).symmetric_difference(all_classes)

# the order of classes in predict_proba's output matches that in clf.classes_.
prob = clf.predict_proba(test_samples)
new_prob = []
for row in prob:
    prob_per_class = zip(clf.classes_, prob) + zip(classes_not_trained, repeat(0.))
    # put the probabilities in class order
    prob_per_class = sorted(prob_per_class)
    new_prob.append(i[1] for i in prob_per_class)
new_prob = np.asarray(new_prob)

new_prob是一个[n_samples，n_classes]数组，就像predict_proba的输出一样，除了它现在包含0个先前未见过的类的概率。

训练sklearn LogisticRegression分类器，没有所有可能的标签

3 个答案: