旧(sklearn 0.17)GMM,DPGM,VBGMM vs new(sklearn 0.18)GaussianMixture和BayesianGaussianMixture

时间:2016-10-08 11:41:12

标签: python scikit-learn cluster-analysis gaussian unsupervised-learning

在之前的scikit-learn版本(0.1.17)中,我使用以下代码自动确定最佳高斯混合模型,并针对无监督聚类优化超参数(alpha,协方差类型,bic)。

# Gaussian Mixture Model 
try:       
    # Determine the most suitable covariance_type
    lowest_bic = np.infty
    bic = []
    cv_types = ['spherical', 'tied', 'diag', 'full']
    for cv_type in cv_types:
        # Fit a mixture of Gaussians with EM
        gmm = mixture.GMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type)
        gmm.fit(transformed_features)
        bic.append(gmm.bic(transformed_features))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_gmm = gmm
            best_covariance_type = cv_type
    gmm = best_gmm
except Exception, e:       
    print 'Error with GMM estimator. Error: %s' % e 

# Dirichlet Process Gaussian Mixture Model  
try:
    # Determine the most suitable alpha parameter
    alpha = 2/math.log(len(transformed_features))     
    # Determine the most suitable covariance_type
    lowest_bic = np.infty
    bic = []
    cv_types = ['spherical', 'tied', 'diag', 'full']
    for cv_type in cv_types:
        # Fit a mixture of Gaussians with EM
        dpgmm = mixture.DPGMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type, alpha = alpha)
        dpgmm.fit(transformed_features)
        bic.append(dpgmm.bic(transformed_features))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_dpgmm = dpgmm
            best_covariance_type = cv_type        
    dpgmm = best_dpgmm                
except Exception, e:       
    print 'Error with DPGMM estimator. Error: %s' % e    

# Variational Inference for Gaussian Mixture Model   
try: 
    # Determine the most suitable alpha parameter 
    alpha = 2/math.log(len(transformed_features))  
    # Determine the most suitable covariance_type
    lowest_bic = np.infty
    bic = []
    cv_types = ['spherical', 'tied', 'diag', 'full']
    for cv_type in cv_types:
        # Fit a mixture of Gaussians with EM
        vbgmm = mixture.VBGMM(n_components=NUMBER_OF_CLUSTERS, covariance_type=cv_type, alpha = alpha)
        vbgmm.fit(transformed_features)
        bic.append(vbgmm.bic(transformed_features))
        if bic[-1] < lowest_bic:
            lowest_bic = bic[-1]
            best_vbgmm = vbgmm
            best_covariance_type = cv_type
    vbgmm = best_vbgmm     
except Exception, e:       
    print 'Error with VBGMM estimator. Error: %s' % e        

如何使用scikit-learn 0.1.18中引入的新高斯混合/贝叶斯高斯混合模型实现相同或相似的行为?

根据scikit-learn文件,没有&#34; alpha&#34;参数已经存在,但有&#34; weight_concentration_prior&#34;参数而不是。这些是否相同? http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html#sklearn.mixture.BayesianGaussianMixture

  

weight_concentration_prior:float |无,可选。       每种组分的重量分布的Dirichlet浓度(Dirichlet)。较高的浓度会增加更多的质量   中心,将导致更多的组件活跃,而a   较低的浓度参数将导致边缘更多的质量   混合物重量单一。参数的值必须是   大于0.如果是None,则设置为1. / n_components。

http://scikit-learn.org/0.17/modules/generated/sklearn.mixture.VBGMM.html

  

alpha:float,默认值为1:       表示dirichlet分布的浓度参数的实数。直觉上,alpha值越高   更有可能的是,高斯模型的变分混合将使用全部   它可以组成。

如果这两个参数(alpha和weight_concentration_prior)相同,是否意味着公式alpha = 2 / math.log(len(transformed_features))仍然适用于weight_concentration_prior = 2 / math.log(len(transformed_features) ))?

1 个答案:

答案 0 :(得分:0)

BIC分数仍可用于GaussianMixture类中实现的GMM的经典/ EM实现。

对于给定的alpha值,BayesianGaussianMixture类可以自动调整有效组件的数量(n_components应该足够大)。

您还可以对对数似然使用标准交叉验证(使用模型的score方法)。