结合顺序特征选择和网格搜索

时间:2018-11-22 11:35:34

标签: python scikit-learn svm grid-search mlxtend

我正在努力将顺序特征选择器(来自mlxtend)与GridSearchCV(来自sklearn)结合在一起。

我的目标是对每组参数进行前向特征选择,以找出哪种参数和特征组合产生最佳分数。 以下代码基于mlxtend用户指南的示例8(请参见http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/#example-8-sequential-feature-selection-and-gridsearch

X = data.values #dataframe with 48 features and 200 rows
y = diags #binary classification for each row

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)


svm = svm.SVC()

sfs = SFS(estimator = svm,
          k_features = (1,len(data.columns)),
          forward = True,
          floating=False,
          scoring = 'f1',
          cv = 5)

pipe = Pipeline([
    ('sfs', sfs),
    ('svm', svm)
])

param_grid = [
    {
     'sfs__estimator__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
     'sfs__estimator__gamma': ['auto', 'scale', 0.001, 0.0001],
     'sfs__estimator__kernel': ['linear', 'rbf'],
     }
]


gs = GridSearchCV(estimator = pipe,
                  param_grid = param_grid,
                  scoring = 'f1',
                  n_jobs=1,
                  cv=5,
                  refit = True)

gs.fit(X_train, y_train)

print("Best parameters via GridSearch", gs.best_params_)

print("\nBest features:\n", gs.best_estimator_.steps[0][1].k_feature_idx_)
print("\nBest score:\n", gs.best_estimator_.steps[0][1].k_score_)

执行此代码时,我得到以下信息:

Best parameters via GridSearch {'sfs__estimator__gamma': 'auto', 'sfs__estimator__kernel': 'linear', 'sfs__estimator__C': 0.001}

Best features:
 (16, 39)

Best score:
 0.31333333333333335

我怀疑这些是最好的结果,因为在设置更高的最小数量的功能时确实得到了更好的结果。

我注意到更改参数网格的顺序会修改结果。使用时:

param_grid = [
    {
     # Notice the change of order for C, 0.01 is now first
     'sfs__estimator__C': [0.01, 0.001, 0.1, 1, 10, 100, 1000],
     'sfs__estimator__gamma': ['auto', 'scale', 0.001, 0.0001],
     'sfs__estimator__kernel': ['linear', 'rbf'],
     }
]

我得到以下结果:

Best parameters via GridSearch {'sfs__estimator__gamma': 'auto', 'sfs__estimator__kernel': 'linear', 'sfs__estimator__C': 0.01}

Best features:
 (16, 39)

Best score:
 0.4428571428571429

最佳参数似乎总是返回每个参数的第一个值。我尝试了其他组合和参数,但总是总是第一个组合,并且最好的功能没有改变。

我使用GridSearchCV错误吗?还是我打印了错误的属性?

0 个答案:

没有答案