我使用修改后的Lloyd算法在k = 2的kmeans中获得相等的簇大小输出。 以下是伪代码:
- Randomly choose 2 points as initialization for the 2 clusters (denoted as c1, c2)
- Repeat below steps until convergence
- Sort all points xi according to ascending values of ||xi-c1|| - ||xi-c2||, i.e. differences in distances to the first and the second cluster
- Put top 50% points in cluster 1 , others in cluster 2
- Recalculate centroids as average of the allocated points (as usual in Lloyd's)
现在上面的算法对我来说很有经验:
之前在文献中提出或分析过这样的算法吗?我能得到一些参考资料吗?
答案 0 :(得分:2)
此处解释了超过2个群集的更通用版本:
https://elki-project.github.io/tutorial/same-size_k_means
我在文献中已经看过几次具有各种尺寸限制的k-means,但我手头没有任何参考资料。我不相信这一点:强迫群集具有相同的大小与找到最小二乘最佳逼近IMHO的k均值思想相矛盾,因为它意味着故意选择更差的近似值。