如何使用scikit-learn选择最佳集群数?

时间:2018-09-03 07:18:43

标签: python-3.x scikit-learn cluster-analysis cross-validation

我想使用silouette评分和预先计算的距离矩阵为聚类算法找到最佳聚类数。在下面的示例中,我正在使用AgglomerativeClustering(但将来可能会使用其他聚类算法)。

from sklearn import cluster, metrics, model_selection


# define some clustering model
agglomerative_clustering = cluster.AgglomerativeClustering(affinity="precomputed")

def _silhouette_scoring(clustering_model, distances):
    return metrics.silhouette_score(distances, clustering_model.labels_, metric="precomputed")

# define distributions over parameters to optimize
n, _ = distances.shape
param_distributions = {'n_clusters': stats.randint(low=1, high=n),
                       'linkage': ["complete", "average"]}

prng = np.random.RandomState(42)
parameter_sampler = model_selection.ParameterSampler(param_distributions, n_iter=100, random_state=prng)

optimal_params = None
optimal_params_score = -np.inf

for i, sampled_params in enumerate(list(parameter_sampler)):
    agglomerative_clustering = cluster.AgglomerativeClustering(affinity="precomputed", **sampled_params)
    agglomerative_clustering.fit(distances)
    sampled_params_score = _silhouette_scoring(agglomerative_clustering, distances)

    if sampled_params_score > optimal_params_score:
        optimal_params, optimal_params_score = sampled_params, sampled_params_score

运行上面的代码是可行的,但是我觉得选择最佳数量的群集是一项非常常见的任务,并且应该使用sklearn.model_selection或{{1} }或类似。该怎么办?

0 个答案:

没有答案