Question

我使用修改后的Lloyd算法在k = 2的kmeans中获得相等的簇大小输出。以下是伪代码：

- Randomly choose 2 points as initialization for the 2 clusters (denoted as c1, c2)
- Repeat below steps until convergence
    - Sort all points xi according to ascending values of ||xi-c1|| - ||xi-c2||, i.e. differences in distances to the first and the second cluster
    - Put top 50% points in cluster 1 , others in cluster 2
    - Recalculate centroids as average of the allocated points (as usual in Lloyd's)

现在上面的算法对我来说很有经验：

它提供平衡的群集
它总是会降低目标

之前在文献中提出或分析过这样的算法吗？我能得到一些参考资料吗？

Answer 1

此处解释了超过2个群集的更通用版本：

https://elki-project.github.io/tutorial/same-size_k_means

我在文献中已经看过几次具有各种尺寸限制的k-means，但我手头没有任何参考资料。我不相信这一点：强迫群集具有相同的大小与找到最小二乘最佳逼近IMHO的k均值思想相矛盾，因为它意味着故意选择更差的近似值。

k = 2的Kmeans算法给出相等的簇大小输出

1 个答案: