因此,我有以下代码可以使用剪影方法选择最佳数量的群集:
def kmeans_silhouette(data) -> Tuple[np.array, np.array]:
"""
Performs silhouette method to choose the best result for kMeans clustering.
:param data: data to be clustered.
:return:
"""
import os
logger.info(len(data))
if len(data) == 1:
return [0], data
range_n_clusters = [2, 3, 4, 5, 6]
labels = None
centroids = None
silhouette = -999
for n_clusters in range_n_clusters:
kmeans = KMeans(n_clusters=n_clusters, random_state=0)
cluster_labels = kmeans.fit_predict(data)
silhouette_avg = silhouette_score(data, cluster_labels)
if silhouette_avg > silhouette:
silhouette = silhouette_avg
labels = cluster_labels
centroids = kmeans.cluster_centers_
if os.environ["DEBUG"]:
logger.info(
f"For n_clusters = {n_clusters}, the average silhouette_score is {silhouette_avg}"
)
return labels, centroids
但是,有时会出现此错误:
File "/home/paula/.local/lib/python3.6/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File "/home/paula/.local/lib/python3.6/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 228, in silhouette_samples
check_number_of_labels(len(le.classes_), n_samples)
File "/home/paula/.local/lib/python3.6/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 35, in check_number_of_labels
"to n_samples - 1 (inclusive)" % n_labels)
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
这在定义一个群集时会发生,因为剪影方法至少需要群集(ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score)。
因此,我检查了唯一的cluster_labels
的数量,并仅检索了一个:
logger.info(np.unique(kmeans.labels_))
INFO [0]
但是指定了我想要的最小簇数为2。我想知道kmeans是否有一个参数指定簇数并且检索的簇数少于预期是否有意义。