Question

我正在编写一个程序，需要对该程序在200多个300个元素的数组的数据集上应用K-means聚类。有人可以给我提供代码解释的链接吗？ 1.通过肘法求k 2.应用k均值方法并获得质心的数组

我自己搜索了上面的内容，但没有找到清楚的代码说明。 P.s.我正在Google Colab上工作，因此，如果有相同的特定方法，请提出建议

我尝试了以下代码，但是，我不断收到以下错误-

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

TypeError: float() argument must be a string or a number, not 'list'


The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)

<ipython-input-70-68e300fd4bf8> in <module>()
     24 
     25 # step 1: find optimal k (number of clusters)
---> 26 find_best_k()
     27 

3 frames

/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

ValueError: setting an array element with a sequence.

Answer 1

假设有12个样本，每个样本具有以下两个特征：

data=np.array([[1,1],[1,2],[2,1.5],[4,5],[5,6],[4,5.5],[5,5],[8,8],[8,8.5],[9,8],[8.5,9],[9,9]])

您可以使用弯头法和群集中心找到最佳群集数，如下例所示：

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

data=np.array([[1,1],[1,2],[2,1.5],[4,5],[5,6],[4,5.5],[5,5],[8,8],[8,8.5],[9,8],[8.5,9],[9,9]])

def find_best_k():
    sum_of_squared_distances = []
    K=range(1,8) # change 8 in your data 
    for k in K:
        km=KMeans(n_clusters=k)
        km=km.fit(data)
        sum_of_squared_distances.append(km.inertia_)
    plt.plot(K, sum_of_squared_distances, 'bx-')
    plt.xlabel('k')
    plt.ylabel('sum_of_squared_distances')
    plt.title('Elbow method for optimal k')
    plt.show()  
    #The plot looks like an arm, and the elbow on the arm is optimal k.

# step 1: find optimal k (number of clusters)
find_best_k()

def run_kmeans(k,data): # k is the optimal number of clusters
    km=KMeans(n_clusters=k) 
    km=km.fit(data)
    centroids = km.cluster_centers_  #get the center of clusters
    #print(centroids)
    return centroids

def plotresults():
    centroids=run_kmeans(3,data)     
    plt.plot(data[0:3,0],data[0:3,1],'ro',data[3:7,0],data[3:7,1],'bo',data[7:12,0],data[7:12,1],'go')
    for i in range(3):
        plt.plot(centroids[i,0],centroids[i,1],'k*')
        plt.text(centroids[i,0],centroids[i,1], "c"+str(i), fontsize=12)
plotresults()

肘部图：

结果：

希望这会有所帮助。

Answer 2

作为对 Roohollah 回答的补充：请注意，用于寻找 K-Means 最佳聚类数的肘部方法纯粹是视觉上的，结果可能不明确。因此，您可能希望将其与轮廓分析相结合，例如，在以下文章中： Choosing the appropriate number of clusters (RealPython)、Silhouette method - including an implementation example in Python (TowardsDataScience)、Silhouette analysis example (Scikit-learn)、 Silhouette (Wikipedia)。

词向量的K-均值聚类（300维）

2 个答案: