具有L1正则化Logistic回归的Sklearn SelectFromModel

时间:2020-06-04 12:36:29

标签: python machine-learning scikit-learn feature-selection

作为我管道的一部分,我想结合u_n = [ 0.00000000e+00 -3.55754723e-04 -5.83161988e-04 -7.28203241e-04 -8.20386731e-04 -8.78649151e-04 -9.15142981e-04 -9.37666984e-04 -9.51225955e-04 -9.59031686e-04 -9.63145318e-04 -9.64889573e-04 -9.65113299e-04 -9.64361236e-04 -9.62982969e-04 -9.61202840e-04 -9.59164819e-04 -9.56961297e-04 -9.54651567e-04 -9.52273679e-04....] u = [ 0.00000000e+00 -5.71470888e-04 -9.86586605e-04 -1.28338884e-03 -1.49272978e-03 -1.63854091e-03 -1.73883197e-03 -1.80686241e-03 -1.85223351e-03 -1.88180768e-03 -1.90043862e-03 -1.91152978e-03 -1.91745053e-03 -1.91984028e-03 -1.91982739e-03 -1.91818493e-03 -1.91544043e-03 -1.91195258e-03 -1.90796453e-03 -1.90364059e-03...] # R = u - u_n R = [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ....] 使用LogisticRegression(penalty='l1')进行特征选择。为了选择合适的正则化量,我用SelectFromModel优化了正则化参数C

GridSearchCV

对于from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression, LassoCV from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectFromModel from sklearn.datasets import load_breast_cancer from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold import numpy as np seed = 111 breast = load_breast_cancer() X = breast.data y = breast.target LR_L1 = LogisticRegression(penalty='l1', random_state=seed, solver='saga', max_iter=1e5) pipeline = Pipeline([('scale', StandardScaler()), ('SelectFromModel', SelectFromModel(LR_L1)), ('classifier', RandomForestClassifier(n_estimators=500, random_state=seed))]) Lambda = np.array([]) for i in [1e-1, 1, 1e-2, 1e-3]: Lambda = np.append(Lambda, i * np.arange(2, 11, 2)) param_grid = {'SelectFromModel__estimator__C': Lambda, 'classifier_max_features': np.arange(10,100, 10)} clf = GridSearchCV(pipeline, param_grid, scoring='roc_auc', n_jobs=7, cv=RepeatedStratifiedKFold(random_state=seed), verbose=1) clf.fit(X, y) 的某些值,我收到以下警告:

C

这是可以理解的。 但是,当将相同的UserWarning: No features were selected: either the data is too noisy or the selection test too strict. 用作分类器而不是特征选择时,在这里,当训练集和用于拟合算法的超参数相同时,我没有问题。从结果看,不可能有0个特征的系数不同于0的特征。

LogisticRegression

这是错误还是我误解了?

1 个答案:

答案 0 :(得分:2)

由于LogisticRegression的正则化过强,导致您捕获了错误。 param_grid中还有一个classifier_max_features参数的错字-应该是classifier__max_features(两个下划线)。

使用正则化值C >= 1e-2,代码可以工作。在这里,您可以找到带有示例的google colab notebook

另外一个注意事项-数据集太小,无法进行如此复杂的操作。