Question

因此，我发现StandardScaler（）可以使我的RFECV进入GridSearchCV并通过嵌套的3倍交叉验证更快地运行。没有StandardScaler（），我的代码运行了超过2天，因此我取消了并决定将StandardScaler注入该过程。但是现在它已经运行了四个多小时，我不确定是否做对了。这是我的代码：

# Choose Linear SVM as classifier
LSVM = SVC(kernel='linear')

selector = RFECV(LSVM, step=1, cv=3, scoring='f1')

param_grid = [{'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}]

clf = make_pipeline(StandardScaler(), 
                GridSearchCV(selector,
                             param_grid,
                             cv=3,
                             refit=True,
                             scoring='f1'))

clf.fit(X, Y)

老实说，我认为我做错了，因为我认为应该将StandardScaler（）放到GridSearchCV（）函数中，以使其每次折叠都标准化数据，而不仅仅是一次（？）。如果我错了或者我的管道不正确，请问纠正我，为什么它仍然长时间运行。

我有8,000行的145个要素被RFECV修剪，还有6个C值被GridSearchCV修剪。因此，对于每个C值，最佳功能集由RFECV确定。

谢谢！

更新：

所以我将StandardScaler放在RFECV中，如下所示：

 clf = SVC(kernel='linear')

 kf = KFold(n_splits=3, shuffle=True, random_state=0)  

 estimators = [('standardize' , StandardScaler()),
               ('clf', clf)]

 class Mypipeline(Pipeline):
     @property
     def coef_(self):
         return self._final_estimator.coef_
     @property
     def feature_importances_(self):
         return self._final_estimator.feature_importances_ 

 pipeline = Mypipeline(estimators)
 rfecv = RFECV(estimator=pipeline, cv=kf, scoring='f1', verbose=10)

 param_grid = [{'estimator__svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}]

 clf = GridSearchCV(rfecv, param_grid, cv=3, scoring='f1', verbose=10)

但是它仍然抛出以下错误：

ValueError：估算器管道的无效参数C（内存=无，步骤= [（'standardscaler'，StandardScaler（copy = True，with_mean = True，> with_std = True））），（'svc'，SVC（C = 1.0，cache_size = 200，class_weight = None，> coef0 = 0.0， Decision_function_shape ='ovr'，degree = 3，gamma ='auto'，kernel ='linear'， max_iter = -1，概率= False，random_state =无，缩小= True， tol = 0.001，verbose = False）））。使用> estimator.get_params().keys()检查可用参数列表。

Answer 1

库马尔是对的。另外，您可能想做的是，在GridSearchCV中打开 verbose 。同样，您可以从一个很小的数字（如5）开始，为SVC的迭代次数增加一个限制，以确保问题不在于收敛。

使用RFECV和GridSearchCV堆叠StandardScaler（）

1 个答案: