scikit-learn中是否有内置的强力特征选择方法?即彻底评估输入功能的所有可能组合,然后找到最佳子集。我熟悉“递归特征消除”类,但我特别感兴趣的是一个接一个地评估输入特征的所有可能组合。
答案 0 :(得分:6)
不,未实施最佳子集选择。最简单的方法是自己编写。这应该让你开始:
from itertools import chain, combinations
from sklearn.cross_validation import cross_val_score
def best_subset_cv(estimator, X, y, cv=3):
n_features = X.shape[1]
subsets = chain.from_iterable(combinations(xrange(k), k + 1)
for k in xrange(n_features))
best_score = -np.inf
best_subset = None
for subset in subsets:
score = cross_val_score(estimator, X[:, subset], y, cv=cv).mean()
if score > best_score:
best_score, best_subset = score, subset
return best_subset, best_score
这在循环内执行 k - 交叉验证,因此在使用 k 2 ᵖ估算器> p 功能。
答案 1 :(得分:1)
结合Fred Foo的答案和nopper,ihadanny和jimijazz的评论,以下代码获得与实验1中第一个例子的R函数regsubsets()(跳跃库的一部分)相同的结果(6.5。书中的“1个最佳子集选择”" R"中的应用程序统计学习简介。
from itertools import combinations
from sklearn.cross_validation import cross_val_score
def best_subset(estimator, X, y, max_size=8, cv=5):
'''Calculates the best model of up to max_size features of X.
estimator must have a fit and score functions.
X must be a DataFrame.'''
n_features = X.shape[1]
subsets = (combinations(range(n_features), k + 1)
for k in range(min(n_features, max_size)))
best_size_subset = []
for subsets_k in subsets: # for each list of subsets of the same size
best_score = -np.inf
best_subset = None
for subset in subsets_k: # for each subset
estimator.fit(X.iloc[:, list(subset)], y)
# get the subset with the best score among subsets of the same size
score = estimator.score(X.iloc[:, list(subset)], y)
if score > best_score:
best_score, best_subset = score, subset
# to compare subsets of different sizes we must use CV
# first store the best subset of each size
best_size_subset.append(best_subset)
# compare best subsets of each size
best_score = -np.inf
best_subset = None
list_scores = []
for subset in best_size_subset:
score = cross_val_score(estimator, X.iloc[:, list(subset)], y, cv=cv).mean()
list_scores.append(score)
if score > best_score:
best_score, best_subset = score, subset
return best_subset, best_score, best_size_subset, list_scores
处的笔记本
答案 2 :(得分:0)
您可能想看看MLxtend's Exhaustive Feature Selector。很显然,它尚未内置到scikit-learn
中(但?),但确实支持其分类器和回归对象。