sklearn kmeans忽略簇数

时间:2020-02-25 21:56:33

标签: python scikit-learn k-means

因此,我有以下代码可以使用剪影方法选择最佳数量的群集:

def kmeans_silhouette(data) -> Tuple[np.array, np.array]:
    """
    Performs silhouette method to choose the best result for kMeans clustering.

    :param data: data to be clustered.
    :return:
    """
    import os

    logger.info(len(data))
    if len(data) == 1:
        return [0], data

    range_n_clusters = [2, 3, 4, 5, 6]
    labels = None
    centroids = None
    silhouette = -999

    for n_clusters in range_n_clusters:
        kmeans = KMeans(n_clusters=n_clusters, random_state=0)
        cluster_labels = kmeans.fit_predict(data)

        silhouette_avg = silhouette_score(data, cluster_labels)
        if silhouette_avg > silhouette:
            silhouette = silhouette_avg
            labels = cluster_labels
            centroids = kmeans.cluster_centers_

        if os.environ["DEBUG"]:
            logger.info(
                f"For n_clusters = {n_clusters}, the average silhouette_score is {silhouette_avg}"
            )

    return labels, centroids

但是,有时会出现此错误:

  File "/home/paula/.local/lib/python3.6/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 117, in silhouette_score
    return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
  File "/home/paula/.local/lib/python3.6/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 228, in silhouette_samples
    check_number_of_labels(len(le.classes_), n_samples)
  File "/home/paula/.local/lib/python3.6/site-packages/sklearn/metrics/cluster/_unsupervised.py", line 35, in check_number_of_labels
    "to n_samples - 1 (inclusive)" % n_labels)
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

这在定义一个群集时会发生,因为剪影方法至少需要群集(ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score)。

因此,我检查了唯一的cluster_labels的数量,并仅检索了一个:

logger.info(np.unique(kmeans.labels_))

INFO [0]

但是指定了我想要的最小簇数为2。我想知道kmeans是否有一个参数指定簇数并且检索的簇数少于预期是否有意义。

0 个答案:

没有答案