如何在scikit-learn管道中绑定参数?

时间:2018-02-19 12:31:53

标签: scikit-learn keras

我有一个pipeline对象,其超参数我想使用RandomizedSearchCV进行优化,但我需要绑定两个参数,从某种意义上说,如果一个设置为一个值,另一个是自动的设置为相同的值。

这是我的具体案例:我将一个减少到nbFeature维度的PCA链接到一个Keras分类器,该分类器要求显示其输入dim nbFeature。当两者不匹配时,显然会失败。请参阅下面的玩具示例:

# setup
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import RandomizedSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

# toy data
n = 500
p = 100
X = np.random.normal(size=(n,p))
Y = np.concatenate((np.zeros(int(n/2)),np.ones(int(n/2))))

# toy pipeline
nbFeature = 10 # the guy to bind between the PCA and my Keras model

reducer = PCA(n_components=nbFeature)

def myBasicDense(n_feature):
    return KerasClassifier(build_fn=buildfn_myBasicDense,n_feature=n_feature,verbose=0) 
def buildfn_myBasicDense(n_feature=777):
    model = Sequential()
    model.add(Dense(1,input_dim=n_feature,activation='softmax'))
    model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])
    return model  
model = myBasicDense(n_feature=nbFeature) # tried using 'reducer.n_components' but this only uses the value once, instead of binding

pipeStep = [('reducer',reducer),('model',model)]
pipe = Pipeline(pipeStep)

# run RandomizedSearchCV
# this works only when sampled 'reducer__n_components' and 'model__n_feature' are equal
gridDist = {'reducer__n_components': [10, 50],'model__n_feature': [10, 50]}

n_iter_search = 2
optimizedPipe = RandomizedSearchCV(
        refit=True,        
        estimator=pipe,
        param_distributions=gridDist,
        n_iter=n_iter_search,
        scoring='accuracy',
        cv=3,         
        verbose=2,
        random_state=12 # chosen so that is fails on second round...
        )

optimizedPipe.fit(X,Y)

所以这是我的问题:有没有办法指定一个管道,它的两个或多个参数必须始终相同,以便我可以只搜索其中一个?

(或者,欢迎使用任何变通方法,包括更好地使用RandomizedSearchCV)。

非常感谢!

1 个答案:

答案 0 :(得分:0)

您的问题有两种解决方案:

更新:此方法仅适用于GridSearchCV,而不适用于RandomizedSearchCV。请使用下面的(2)。

1)将gridDist中的参数组合在一起。

而不是

gridDist = {'reducer__n_components': [10, 50],'model__n_feature': [10, 50]}

你应该这样做:

gridDist = [{'reducer__n_components': [10],'model__n_feature': [10]},
            {'reducer__n_components': [50],'model__n_feature': [50]}]

这使它成为2个词典。并且字典中的参数总是一起探索。所以你总是会有n_components和n_feature值相同。请参阅此示例以更好地使用此类参数网格:

2)按照我在评论中的建议制作一个包装器。像这样:

def myBasicDense(n_feature):
    return KerasClassifier(build_fn= buildfn_myBasicDense, n_feature=n_feature, verbose=0) 
def buildfn_myBasicDense(n_feature=777):
    model = Sequential()
    model.add(Dense(1,input_dim=n_feature,activation='softmax'))
    model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])
    return model

class CustomWrapper(BaseEstimator, ClassifierMixin):

    def __init__(self, n_features=10):
        self.n_features = n_features

        # This n_features is passed to both your parts of the pipeline
        self.pipe = Pipeline([('reducer',PCA(n_components=n_features)),('model', myBasicDense(n_feature=n_features))])

    def fit(self, X, y):

        self.pipe.fit(X, y)
        return self

    def predict(self, X):
        return self.pipe.predict(X)

    def set_params(self, **params):
        super(CustomWrapper, self).set_params(**params)
        self.pipe = Pipeline([('reducer',PCA(n_components=self.n_features)),('model',myBasicDense(n_feature=self.n_features))])
        return self

现在您只有一个超级参数可供搜索 - n_features。因此您的参数网格变为:

gridDist = {'n_features': [10, 50]}

您按如下方式初始化RandomSearch:

wrapperModel = CustomWrapper()

optimizedPipe = RandomizedSearchCV(
        refit=True,        
        estimator=wrapperModel,
        param_distributions=gridDist,
        n_iter=n_iter_search,
        scoring='accuracy',
        cv=3,         
        verbose=2,
        random_state=12 # chosen so that is fails on second round...
        )