Question

This question and answer表明当使用scikit-learn的一个专用特征选择例程执行特征选择时，可以按如下方式检索所选特征的名称：

np.asarray(vectorizer.get_feature_names())[featureSelector.get_support()]

例如，在上面的代码中，featureSelector可能是sklearn.feature_selection.SelectKBest或sklearn.feature_selection.SelectPercentile的实例，因为这些类实现了返回布尔掩码的get_support方法或所选要素的整数索引。

如果执行feature selection via linear models penalized with the L1 norm，则不清楚如何完成此操作。 sklearn.svm.LinearSVC没有get_support方法，文档在使用transform方法消除样本集合中的要素后，没有明确如何检索要素索引。我在这里错过了什么吗？

Answer 1

对于稀疏估计器，通常可以通过检查系数向量中非零项的位置来找到支持（假设存在系数向量，例如线性模型就是这种情况）

support = np.flatnonzero(estimator.coef_)

对于你的LinearSVC l1惩罚，因此

from sklearn.svm import LinearSVC
svc = LinearSVC(C=1., penalty='l1', dual=False)
svc.fit(X, y)
selected_feature_names = np.asarray(vectorizer.get_feature_names())[np.flatnonzero(svc.coef_)]

Answer 2

我一直在使用sklearn 15.2，根据LinearSVC documentation，coef_是一个数组，shape = [n_features]如果n_classes == 2 else [n_classes，n_features]。首先，np.flatnonzero不适用于多类。您将有索引超出范围错误。其次，它应该是np.where(svc.coef_ != 0)[1]而不是np.where(svc.coef_ != 0)[0]。 0是类的索引，而不是功能。我最终使用了np.asarray(vectorizer.get_feature_names())[list(set(np.where(svc.coef_ != 0)[1]))]

sklearn：在基于L1的特征选择后获取特征名称

2 个答案: