我正在努力将顺序特征选择器(来自mlxtend)与GridSearchCV(来自sklearn)结合在一起。
我的目标是对每组参数进行前向特征选择,以找出哪种参数和特征组合产生最佳分数。 以下代码基于mlxtend用户指南的示例8(请参见http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/#example-8-sequential-feature-selection-and-gridsearch)
X = data.values #dataframe with 48 features and 200 rows
y = diags #binary classification for each row
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
svm = svm.SVC()
sfs = SFS(estimator = svm,
k_features = (1,len(data.columns)),
forward = True,
floating=False,
scoring = 'f1',
cv = 5)
pipe = Pipeline([
('sfs', sfs),
('svm', svm)
])
param_grid = [
{
'sfs__estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'sfs__estimator__gamma': ['auto', 'scale', 0.001, 0.0001],
'sfs__estimator__kernel': ['linear', 'rbf'],
}
]
gs = GridSearchCV(estimator = pipe,
param_grid = param_grid,
scoring = 'f1',
n_jobs=1,
cv=5,
refit = True)
gs.fit(X_train, y_train)
print("Best parameters via GridSearch", gs.best_params_)
print("\nBest features:\n", gs.best_estimator_.steps[0][1].k_feature_idx_)
print("\nBest score:\n", gs.best_estimator_.steps[0][1].k_score_)
执行此代码时,我得到以下信息:
Best parameters via GridSearch {'sfs__estimator__gamma': 'auto', 'sfs__estimator__kernel': 'linear', 'sfs__estimator__C': 0.001}
Best features:
(16, 39)
Best score:
0.31333333333333335
我怀疑这些是最好的结果,因为在设置更高的最小数量的功能时确实得到了更好的结果。
我注意到更改参数网格的顺序会修改结果。使用时:
param_grid = [
{
# Notice the change of order for C, 0.01 is now first
'sfs__estimator__C': [0.01, 0.001, 0.1, 1, 10, 100, 1000],
'sfs__estimator__gamma': ['auto', 'scale', 0.001, 0.0001],
'sfs__estimator__kernel': ['linear', 'rbf'],
}
]
我得到以下结果:
Best parameters via GridSearch {'sfs__estimator__gamma': 'auto', 'sfs__estimator__kernel': 'linear', 'sfs__estimator__C': 0.01}
Best features:
(16, 39)
Best score:
0.4428571428571429
最佳参数似乎总是返回每个参数的第一个值。我尝试了其他组合和参数,但总是总是第一个组合,并且最好的功能没有改变。
我使用GridSearchCV错误吗?还是我打印了错误的属性?