我有一个pipeline
对象,其超参数我想使用RandomizedSearchCV
进行优化,但我需要绑定两个参数,从某种意义上说,如果一个设置为一个值,另一个是自动的设置为相同的值。
这是我的具体案例:我将一个减少到nbFeature
维度的PCA链接到一个Keras分类器,该分类器要求显示其输入dim nbFeature
。当两者不匹配时,显然会失败。请参阅下面的玩具示例:
# setup
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import RandomizedSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
# toy data
n = 500
p = 100
X = np.random.normal(size=(n,p))
Y = np.concatenate((np.zeros(int(n/2)),np.ones(int(n/2))))
# toy pipeline
nbFeature = 10 # the guy to bind between the PCA and my Keras model
reducer = PCA(n_components=nbFeature)
def myBasicDense(n_feature):
return KerasClassifier(build_fn=buildfn_myBasicDense,n_feature=n_feature,verbose=0)
def buildfn_myBasicDense(n_feature=777):
model = Sequential()
model.add(Dense(1,input_dim=n_feature,activation='softmax'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])
return model
model = myBasicDense(n_feature=nbFeature) # tried using 'reducer.n_components' but this only uses the value once, instead of binding
pipeStep = [('reducer',reducer),('model',model)]
pipe = Pipeline(pipeStep)
# run RandomizedSearchCV
# this works only when sampled 'reducer__n_components' and 'model__n_feature' are equal
gridDist = {'reducer__n_components': [10, 50],'model__n_feature': [10, 50]}
n_iter_search = 2
optimizedPipe = RandomizedSearchCV(
refit=True,
estimator=pipe,
param_distributions=gridDist,
n_iter=n_iter_search,
scoring='accuracy',
cv=3,
verbose=2,
random_state=12 # chosen so that is fails on second round...
)
optimizedPipe.fit(X,Y)
所以这是我的问题:有没有办法指定一个管道,它的两个或多个参数必须始终相同,以便我可以只搜索其中一个?
(或者,欢迎使用任何变通方法,包括更好地使用RandomizedSearchCV
)。
非常感谢!
答案 0 :(得分:0)
您的问题有两种解决方案:
更新:此方法仅适用于GridSearchCV,而不适用于RandomizedSearchCV。请使用下面的(2)。
1)将gridDist中的参数组合在一起。
而不是
gridDist = {'reducer__n_components': [10, 50],'model__n_feature': [10, 50]}
你应该这样做:
gridDist = [{'reducer__n_components': [10],'model__n_feature': [10]},
{'reducer__n_components': [50],'model__n_feature': [50]}]
这使它成为2个词典。并且字典中的参数总是一起探索。所以你总是会有n_components和n_feature值相同。请参阅此示例以更好地使用此类参数网格:
2)按照我在评论中的建议制作一个包装器。像这样:
def myBasicDense(n_feature):
return KerasClassifier(build_fn= buildfn_myBasicDense, n_feature=n_feature, verbose=0)
def buildfn_myBasicDense(n_feature=777):
model = Sequential()
model.add(Dense(1,input_dim=n_feature,activation='softmax'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])
return model
class CustomWrapper(BaseEstimator, ClassifierMixin):
def __init__(self, n_features=10):
self.n_features = n_features
# This n_features is passed to both your parts of the pipeline
self.pipe = Pipeline([('reducer',PCA(n_components=n_features)),('model', myBasicDense(n_feature=n_features))])
def fit(self, X, y):
self.pipe.fit(X, y)
return self
def predict(self, X):
return self.pipe.predict(X)
def set_params(self, **params):
super(CustomWrapper, self).set_params(**params)
self.pipe = Pipeline([('reducer',PCA(n_components=self.n_features)),('model',myBasicDense(n_feature=self.n_features))])
return self
现在您只有一个超级参数可供搜索 - n_features
。因此您的参数网格变为:
gridDist = {'n_features': [10, 50]}
您按如下方式初始化RandomSearch:
wrapperModel = CustomWrapper()
optimizedPipe = RandomizedSearchCV(
refit=True,
estimator=wrapperModel,
param_distributions=gridDist,
n_iter=n_iter_search,
scoring='accuracy',
cv=3,
verbose=2,
random_state=12 # chosen so that is fails on second round...
)