Question

我正在对一组数据实现自己的kmeans算法。当我选择数据集中的任意随机点作为质心时，我得到的准确性很差。但是，当我从每一类数据中随机选择一个质心时，我可以获得很好的准确性。请帮我解决我要去的地方。下面是我的实现：

首先，我生成随机质心并将其提供给函数，根据每个质点最接近的质心将其分配给聚类

def assignClustersKNN(features,centroids,labels):
    assignments = defaultdict(list)
    distances = [[0 for x in range(len(centroids))] for y in range(len(features))]
    #Iterating over all data points
    for i in range(len(features)):
        #Iterating over all centroids
        for j in range(0,len(centroids)):
            distances[i][j] = euclidean(features[i],centroids[j])
        #Getting the index of the centroid which is the closest
        clusterAssigned = distances[i].index(min(distances[i]))
        #adding the point to the closest cluster
        assignments[clusterAssigned].append(features[i])    
    return assignments

然后，我通过计算聚类中点的均值来更新每个聚类的质心，这是该聚类的质心

def updateCentroids(assignments):
    newCentroids = np.zeros(shape=(len(assignments.keys()),3))
    for i in assignments.keys():
        #getting the datapoints of each cluster
        clusterMembers = assignments[i]
        #computing the mean of the datapoints of the cluster
        newCentroids[i] = np.mean(clusterMembers,axis=0)
    return newCentroids

我选择的停止条件为，当一个迭代中集群的质心与上一个迭代的质心没有差异时，这意味着集群没有变化，我停止了该过程

随机质心选择导致kmeans实现的精度不高

0 个答案: