作为我管道的一部分,我想结合u_n = [ 0.00000000e+00 -3.55754723e-04 -5.83161988e-04 -7.28203241e-04
-8.20386731e-04 -8.78649151e-04 -9.15142981e-04 -9.37666984e-04
-9.51225955e-04 -9.59031686e-04 -9.63145318e-04 -9.64889573e-04
-9.65113299e-04 -9.64361236e-04 -9.62982969e-04 -9.61202840e-04
-9.59164819e-04 -9.56961297e-04 -9.54651567e-04 -9.52273679e-04....]
u = [ 0.00000000e+00 -5.71470888e-04 -9.86586605e-04 -1.28338884e-03
-1.49272978e-03 -1.63854091e-03 -1.73883197e-03 -1.80686241e-03
-1.85223351e-03 -1.88180768e-03 -1.90043862e-03 -1.91152978e-03
-1.91745053e-03 -1.91984028e-03 -1.91982739e-03 -1.91818493e-03
-1.91544043e-03 -1.91195258e-03 -1.90796453e-03 -1.90364059e-03...]
# R = u - u_n
R = [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ....]
使用LogisticRegression(penalty='l1')
进行特征选择。为了选择合适的正则化量,我用SelectFromModel
优化了正则化参数C
。
GridSearchCV
对于from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
import numpy as np
seed = 111
breast = load_breast_cancer()
X = breast.data
y = breast.target
LR_L1 = LogisticRegression(penalty='l1', random_state=seed, solver='saga', max_iter=1e5)
pipeline = Pipeline([('scale', StandardScaler()),
('SelectFromModel', SelectFromModel(LR_L1)),
('classifier', RandomForestClassifier(n_estimators=500, random_state=seed))])
Lambda = np.array([])
for i in [1e-1, 1, 1e-2, 1e-3]:
Lambda = np.append(Lambda, i * np.arange(2, 11, 2))
param_grid = {'SelectFromModel__estimator__C': Lambda,
'classifier_max_features': np.arange(10,100, 10)}
clf = GridSearchCV(pipeline, param_grid, scoring='roc_auc', n_jobs=7, cv=RepeatedStratifiedKFold(random_state=seed),
verbose=1)
clf.fit(X, y)
的某些值,我收到以下警告:
C
这是可以理解的。
但是,当将相同的UserWarning: No features were selected: either the data is too noisy or the selection test too strict.
用作分类器而不是特征选择时,在这里,当训练集和用于拟合算法的超参数相同时,我没有问题。从结果看,不可能有0个特征的系数不同于0的特征。
LogisticRegression
这是错误还是我误解了?
答案 0 :(得分:2)
由于LogisticRegression的正则化过强,导致您捕获了错误。 param_grid
中还有一个classifier_max_features
参数的错字-应该是classifier__max_features
(两个下划线)。
使用正则化值C >= 1e-2
,代码可以工作。在这里,您可以找到带有示例的google colab notebook。
另外一个注意事项-数据集太小,无法进行如此复杂的操作。