我想使用silouette评分和预先计算的距离矩阵为聚类算法找到最佳聚类数。在下面的示例中,我正在使用AgglomerativeClustering
(但将来可能会使用其他聚类算法)。
from sklearn import cluster, metrics, model_selection
# define some clustering model
agglomerative_clustering = cluster.AgglomerativeClustering(affinity="precomputed")
def _silhouette_scoring(clustering_model, distances):
return metrics.silhouette_score(distances, clustering_model.labels_, metric="precomputed")
# define distributions over parameters to optimize
n, _ = distances.shape
param_distributions = {'n_clusters': stats.randint(low=1, high=n),
'linkage': ["complete", "average"]}
prng = np.random.RandomState(42)
parameter_sampler = model_selection.ParameterSampler(param_distributions, n_iter=100, random_state=prng)
optimal_params = None
optimal_params_score = -np.inf
for i, sampled_params in enumerate(list(parameter_sampler)):
agglomerative_clustering = cluster.AgglomerativeClustering(affinity="precomputed", **sampled_params)
agglomerative_clustering.fit(distances)
sampled_params_score = _silhouette_scoring(agglomerative_clustering, distances)
if sampled_params_score > optimal_params_score:
optimal_params, optimal_params_score = sampled_params, sampled_params_score
运行上面的代码是可行的,但是我觉得选择最佳数量的群集是一项非常常见的任务,并且应该使用sklearn.model_selection
或{{1} }或类似。该怎么办?