sklearn KMedoids返回空簇

时间:2020-06-05 12:24:19

标签: python scikit-learn cluster-analysis

我正在使用sklearn_extra.cluster中的KMedoids。我将其与预先计算的距离矩阵(metric ='precomputed')一起使用,并且可以正常工作。但是,我们发现了距离矩阵的计算方式中的一个错误,因此必须自己实现。从那时起,KMedoids算法不再起作用。这是堆栈跟踪:

C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 1 is empty! self.labels_[self.medoid_indices_[1]] may not be labeled with its corresponding cluster (1).
  warnings.warn(enter code here
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 2 is empty! self.labels_[self.medoid_indices_[2]] may not be labeled with its corresponding cluster (2).
  warnings.warn(
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 3 is empty! self.labels_[self.medoid_indices_[3]] may not be labeled with its corresponding cluster (3).
  warnings.warn(
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 4 is empty! self.labels_[self.medoid_indices_[4]] may not be labeled with its corresponding cluster (4).
  warnings.warn(
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 5 is empty! self.labels_[self.medoid_indices_[5]] may not be labeled with its corresponding cluster (5).
  warnings.warn(
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 6 is empty! self.labels_[self.medoid_indices_[6]] may not be labeled with its corresponding cluster (6).
  warnings.warn(
C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\sklearn_extra\cluster\_k_medoids.py:231: UserWarning: Cluster 7 is empty! self.labels_[self.medoid_indices_[7]] may not be labeled with its corresponding cluster (7).
  warnings.warn(

我已经检查了距离矩阵,它是一个二维nparray,其维数为n_data x n_data,对角线上的值为零,因此这不应该成为问题。所有值都在0到1之间。我们曾经使用this algorithm for the Gower distance,但由于某种原因仅拥有分类数据时,该方法不起作用。我们所有的值都是布尔值。高尔距离返回以下信息:

File "C:\Users\...\AppData\Local\Programs\Python\Python38-32\lib\site-packages\gower\gower_dist.py", line 62, in gower_matrix
    Z_num = np.divide(Z_num ,num_max,out=np.zeros_like(Z_num), where=num_max!=0)
TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode '?') according to the casting rule ''same_kind''

我也尝试将KMedoids聚类,并且确实起作用。但是,您需要使用pyclustering自己定义初始medoid,而我发现的方法不适用于分类数据。 (请参见下文)

initial_medoids = kmeans_plusplus_initializer(data, n_clus, kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE).initialize(return_index=True)

Stacktrace:

File "path_to_file", line 19, in <module>
    initial_medoids = kmeans_plusplus_initializer(data, n_clus, kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE).initialize(return_index=True)
  File "path\Python\Python38-32\lib\site-packages\pyclustering\cluster\center_initializer.py", line 357, in initialize
    index_point = self.__get_next_center(centers)
  File "path\Python\Python38-32\lib\site-packages\pyclustering\cluster\center_initializer.py", line 256, in __get_next_center
    distances = self.__calculate_shortest_distances(self.__data, centers)
  File "path\Python\Python38-32\lib\site-packages\pyclustering\cluster\center_initializer.py", line 236, in __calculate_shortest_distances      
    dataset_differences[index_center] = numpy.sum(numpy.square(data - center), axis=1).T
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

我的问题可以通过三种方法解决,所以我希望有人可以帮助我:

  1. 有人知道为什么sk-learn的KMedoids不起作用,并且可以帮助我,所以我可以使用它。
  2. 有人知道我使用PyPI的Gower功能有什么问题,所以我可以使用pyclustering或sklearn。
  3. 有人知道我如何轻松地找到用于pyclustering的初始medoid,因此我可以使用pyclustering。

我在下面发布了一个简单的代码版本。

import pandas as pd
import gower_distance as dist
from sklearn_extra.cluster import KMedoids

data = pd.read_csv(path_to_data)
dist = calcDist(data) # Returns NxN array where N is the amount of data points
# I'm using 8 clusters, which is the default, so I haven't defined it
kmedoids = KMedoids(metric='precomputed').fit(dist)
labels = kmedoids.predict(dist)

2 个答案:

答案 0 :(得分:2)

我也收到了警告(但是使用欧几里得距离)。使用集群核心的另一个初始化为我修复了该问题:

kmedoids = KMedoids(metric='precomputed', init='k-medoids++').fit(dist)

答案 1 :(得分:0)

要从经过训练的模型中获取聚类标签(即火车标签),

data = pd.read_csv(path_to_data)
dist = calcDist(data)
kmedoids = KMedoids(metric='precomputed').fit(dist)
labels = kmedoids.labels_

要使用经过训练的k-medoids模型将kmedoids.predict与任何预测数据一起使用,您需要从N x K预测数据到N medoids计算K距离矩阵, strong>正确编入索引。

medoids = predictData[kmedoids.medoid_indices_, :]
distToMedoids = calcDistToMedoids(predictData, medoids) # with the same metric used in training
predict_labels = kmedoids.predict(distToMedoids)
predict_labels = np.argmin(distToMedoids, axis=1) # what .predict() does 

您可以从the source code中查看更多信息。