Question

我想通过scikit-learn使用函数GaussianMixture，我必须执行模型选择。我想通过使用GridSearchCV来做到这一点，我想用于选择BIC和AIC。这两个值都实现为GaussianMixture（），但我不知道如何将它们插入到我的自定义记分器的定义中，因为函数

make_scorer(score_func, greater_is_better=True, needs_proba=False, needs_threshold=False, **kwargs)

我用来创建我的自定义记分员，将一个函数score_funct作为输入，必须定义为

score_func(y, y_pred, **kwargs)

有人可以帮助我吗？

Answer 1

使用BIC / AIC是使用交叉验证的替代。 GridSearchCV使用交叉验证选择模型。要使用BIC / AIC进行模型选择，我们必须做一些不同的事情。让我们举一个例子，我们从两个高斯生成样本，然后尝试使用scikit-learn来拟合它们。

import numpy as np
X1 = np.random.multivariate_normal([0.,0.],[[1.,0.],[0.,1.]],10000)
X2 = np.random.multivariate_normal([10.,10.],[[1.,0.],[0.,1.]],10000)
X = np.vstack((X1,X2))
np.random.shuffle(X)

方法1：交叉验证

Cross validation涉及将数据拆分成碎片。然后，将模型放在某些部分上（＆＃39;训练＆＃39;）并测试它在剩余部分上的表现（＆＃39;验证＆＃39;）。这可以防止过度贴合。在这里，我们将使用双重交叉验证，我们将数据分成两半。

from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

#check 1->4 components
tuned_parameters = {'n_components': np.array([1,2,3,4])}
#construct grid search object that uses 2 fold cross validation
clf = GridSearchCV(GaussianMixture(),tuned_parameters,cv=2)
#fit the data
clf.fit(X)
#plot the number of Gaussians against their rank
plt.scatter(clf.cv_results_['param_n_components'],\
            clf.cv_results_['rank_test_score'])

我们可以看到，双重交叉验证有利于两个高斯分量，正如我们所期望的那样。

方法2：BIC / AIC

我们可以使用给定每个高斯数的最佳拟合模型来评估BIC，而不是使用交叉验证。然后我们选择具有最低BIC的模型。如果使用AIC，程序将是相同的（虽然它是一个不同的统计数据，并且可以提供不同的答案：但您的代码结构将与下面相同）。

bic = np.zeros(4)
n = np.arange(1,5)
models = []
#loop through each number of Gaussians and compute the BIC, and save the model
for i,j in enumerate(n):
    #create mixture model with j components
    gmm = GaussianMixture(n_components=j)
    #fit it to the data
    gmm.fit(X)
    #compute the BIC for this model
    bic[i] = gmm.bic(X)
    #add the best-fit model with j components to the list of models
    models.append(gmm)

执行此程序后，我们可以根据BIC绘制高斯数。

plt.plot(n,bic)

所以我们可以看到BIC最小化为两个高斯，所以最好的模型根据这种方法也有两个组成部分。

因为我从两个非常分离的高斯中取出了10000个样本（即它们的中心之间的距离比它们的任何一个分散都大得多），答案非常明确。情况并非总是如此，通常这些方法都不会自信地告诉你使用哪个高斯数，而是一些合理的范围。

使用GridSearch

1 个答案: