如何使用`gridsearchCV`和`pipeline`调整Python中预先计算的RBF内核的'gamma`参数?

时间:2017-10-10 17:33:12

标签: python kernel svm grid-search hyperparameters

我正在尝试使用gamma中的RBFgridsearchCV()调整预计算Pipeline内核的scikit-learn参数。我按照以下两个StackOverflow链接中的说明进行了操作:

  1. Is it possible to tune parameters with grid search for custom kernels in scikit-learn?
  2. how to tune parameters of custom kernel function with pipeline in scikit-learn
  3. 但是,这两个链接显示了使用Sklearn's内置chi2_kernelrbf_kernel函数的示例,而我有兴趣编写自己的Gram矩阵内核,如我的 < em>最低工作示例 以下代码。

    请注意,由于原始问题的复杂性,我故意在Train函数体中写了Testdef main()个集合;其中我将有一个for循环用于从目录加载多个数据集,以解决二进制一对一分类问题。因此,我希望将这些TrainTest数据集保留在主函数体中。我还必须单独计算Gram矩阵G_TrainG_Test(不是一步),因为我在我的示例代码中进行计算。

    可以用Iris或任何其他数据集替换我的虚拟数据集。

    import numpy as np
    from sklearn.svm import SVC
    from sklearn.model_selection import GridSearchCV, train_test_split
    from scipy.spatial.distance import cdist
    from sklearn.pipeline import Pipeline
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.model_selection import ParameterGrid
    import sklearn
    import sys
    
    class myKernel(BaseEstimator, TransformerMixin):
        def __init__(self, Train, Test, gamma=1.0):
            super(myKernel,self).__init__()
            self.gamma = gamma
            self.Train = Train
            self.Test = Test
    
        def fit(self, **fit_params):
            return self
    
        def transform(self):
            gamma = self.gamma
            Train = self.Train
            Test = self.Test        
    
            G_Train = np.exp(-gamma*np.square(cdist(Train,Train, 'euclidean')))
            G_Test = np.exp(-gamma*np.square(cdist(Test, Train, 'euclidean'))) 
            return G_Train, G_Test
    
    def main():   
    
        print('python: {}'.format(sys.version))
        print('numpy: {}'.format(np.__version__))
        print('sklearn: {}'.format(sklearn.__version__))
        print()
        np.random.seed(0)
    
        Train = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]])
        Test = np.array([[4, 5, 6],[0, 1, 0], [1, 2, 1], [0, 4, 1]])
    
        Train_label = [1, 1, 1, 0, 0]
        Test_label = [0, 0, 1, 1]
    
        my_kernel = myKernel(Train, Test)
        svm = SVC(kernel='precomputed')
        pipe = Pipeline(steps=[('svm', svm)])
    
        p = [{'svm__C': [[1, 10]], 'svm__gamma': [[0.01, 0.1]]}]  
        parameter = ParameterGrid(p)  
        parameter = np.ravel(parameter)
    
        clf = GridSearchCV(pipe, parameter, n_jobs=-1, cv=2, refit='True')
    
        G_Train, G_Test = my_kernel.transform() 
    
        print(clf.fit(G_Train, Train_label))
    
        #Best parameters
        print('\nBest Parameters: ', clf.best_params_)
    
        print('\npredicted labels: ', clf.best_estimator_.predict(G_Test))
        print("\nAccuracy on test set: {:.2f}%\n".format((clf.score(G_Test, Test_label))*100))
    
    if __name__ == '__main__':
        main()
    

    可以毫无问题地调整参数C,但是,我注意到只有参数gamma第一个值显示为找到的最佳参数。在上面的示例中,我获得了以下最佳参数C = 1, gamma = 0.01。无论C&amp;的值是多少? gamma我添加了p,我始终只获得序列中gamma的第一个值。以下是上述代码的输出:

    输出:

    python: 3.5.2 |Anaconda custom (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
    numpy: 1.13.1
    sklearn: 0.19.0
    
    GridSearchCV(cv=2, error_score='raise',
           estimator=Pipeline(memory=None,
         steps=[('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto',
      kernel='precomputed', max_iter=-1, probability=False, random_state=None,
      shrinking=True, tol=0.001, verbose=False))]),
           fit_params=None, iid=True, n_jobs=-1,
           param_grid=array([{'svm__gamma': [0.01, 0.1], 'svm__C': [1, 10]}], dtype=object),
           pre_dispatch='2*n_jobs', refit='True', return_train_score=True,
           scoring=None, verbose=0)
    
    Best Parameters:  {'svm__gamma': 0.01, 'svm__C': 1}
    
    predicted labels:  [1 1 1 1]
    Accuracy on test set: 50.00%
    

    我将不胜感激任何建议。

0 个答案:

没有答案