使用Scikit-Learn进行无监督学习网格搜索

时间:2016-05-24 20:00:34

标签: scikit-learn

我收到以下代码的错误:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.grid_search import GridSearchCV
from sklearn import linear_model, mixture, decomposition, datasets

# load the data
digits = load_digits()
data = digits.data

pca = PCA(n_components=15, whiten=False)
data = pca.fit_transform(digits.data)

gmm = mixture.GMM()

# use grid search cross-validation 
params = {'gmm__n_components':(2, 3)}

grid = GridSearchCV(gmm, params)
grid.fit(data)

ERROR:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-07b1b825ee22> in <module>()
     22 
     23 grid = GridSearchCV(gmm, params)
---> 24 grid.fit(data)
     25 

C:\Anaconda2\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
    802 
    803         """
--> 804         return self._fit(X, y, ParameterGrid(self.param_grid))
    805 
    806 

C:\Anaconda2\lib\site-packages\sklearn\grid_search.pyc in _fit(self, X, y, parameter_iterable)
    551                                     self.fit_params, return_parameters=True,
    552                                     error_score=self.error_score)
--> 553                 for parameters in parameter_iterable
    554                 for train, test in cv)
    555 

C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __call__(self, iterable)
    802             self._iterating = True
    803 
--> 804             while self.dispatch_one_batch(iterator):
    805                 pass
    806 

C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.pyc in dispatch_one_batch(self, iterator)
    660                 return False
    661             else:
--> 662                 self._dispatch(tasks)
    663                 return True
    664 

C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.pyc in _dispatch(self, batch)
    568 
    569         if self._pool is None:
--> 570             job = ImmediateComputeBatch(batch)
    571             self._jobs.append(job)
    572             self.n_dispatched_batches += 1

C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __init__(self, batch)
    181         # Don't delay the application, to avoid keeping the input
    182         # arguments in memory
--> 183         self.results = batch()
    184 
    185     def get(self):

C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __call__(self)
     70 
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73 
     74     def __len__(self):

C:\Anaconda2\lib\site-packages\sklearn\cross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1518 
   1519     if parameters is not None:
-> 1520         estimator.set_params(**parameters)
   1521 
   1522     start_time = time.time()

C:\Anaconda2\lib\site-packages\sklearn\base.pyc in set_params(self, **params)
    259                                      'Check the list of available parameters '
    260                                      'with `estimator.get_params().keys()`.' %
--> 261                                      (name, self))
    262                 sub_object = valid_params[name]
    263                 sub_object.set_params(**{sub_name: value})

ValueError: Invalid parameter gmm for estimator GMM(covariance_type='diag', init_params='wmc', min_covar=0.001,
  n_components=1, n_init=1, n_iter=100, params='wmc', random_state=None,
  thresh=None, tol=0.001, verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.

虽然我发现Scikit-Learn上的类似代码工作正常,但请参阅下面的代码,但上面的代码给我的错误唯一的区别是算法,这会有所作为吗?我该如何解决这个问题? 感谢。

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.grid_search import GridSearchCV

# load the data
digits = load_digits()
data = digits.data

# project the 64-dimensional data to a lower dimension
pca = PCA(n_components=15, whiten=False)
data = pca.fit_transform(digits.data)

# use grid search cross-validation to optimize the bandwidth
params = {'bandwidth': np.logspace(-1, 1, 20)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(data)

print("best bandwidth: {0}".format(grid.best_estimator_.bandwidth))

1 个答案:

答案 0 :(得分:0)

我发现您的代码存在两个问题。

首先,因为您只是将单个估算器传递给GridSearchCV,所以不应在参数网格中的参数名称的开头包含gmm__。删除它会让您超越上面引用的错误。您可以按如下方式更改参数网格分配:

params = {'n_components':(2, 3)}

但是一旦你遇到这个错误,你会发现你遇到了第二个问题。 GMM.score()返回一个数组,而不是一个得分值。从这个意义上讲,它与sklearn对KMeans,KernelDensity,PCA等的操作不同(请参阅此问题的讨论:https://github.com/scikit-learn/scikit-learn/issues/2473)。 GMM的得分数组会导致GridSearchCV抛出错误,因为它需要单个值。您从sklearn的网站提供的示例使用KernelDensity,因此不会出现此类问题。

我建议使用另一种算法,该算法具有与GridSearchCV的预期相符的分数函数,例如KMeans或KernelDensity。或者,您可以为要测试的每个n_component级别单独运行gmm.fit(),并以最适合您的方式比较结果。