Question

我发现sklearn kmeans使用虚点作为聚类质心。

到目前为止，我发现没有选择将实际数据点用作sklearn中的质心。

我目前正在计算最接近质心的数据点，但认为可能会有更简单的方法。

我不一定要局限于kmeans。

围绕真实数据质心进行聚类的Google搜索也没有取得成果。

以前有人遇到过同样的问题吗？

import numpy as np
from sklearn.cluster import KMeans
import math

def distance(a, b):
    dist = math.sqrt((a[0] - b[0])**2 + (a[1] - b[1])**2)
    return dist

x = np.random.rand(10)
y = np.random.rand(10)

xy = np.array((x,y)).T

kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids  = kmeans.cluster_centers_

print(np.where(xy == centroids[0])[0])

for c in centroids:
    nearest = min(xy, key=lambda x: distance(x, c))
    print('centroid', c)
    print('nearest data point to centroid', nearest)

Answer 1

实际上sklearn.cluster.KMeans现在允许使用自定义质心。请参见此处的https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html init部分或在sklearn.kmneans的源代码中：https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/cluster/_kmeans.py#L649

“ 如果传递了ndarray，则其形状应为n_clusters，n_features，并给出初始中心。”

我希望它能起作用。请尝试。

Answer 2

在您的集合中不必将质心作为点。由于您位于2D空间中，因此会发现具有2D坐标的质心。如果要打印每个质心和每个点之间的距离，可以：

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

x = np.random.rand(10)
y = np.random.rand(10)

xy = np.array((x,y)).T

kmeans = KMeans(n_clusters=2)
kmeans.fit(xy)
centroids  = kmeans.cluster_centers_

for centroid in centroids:
    print(f'List of distances between centroid {centroid} and each point:\n\
          {np.linalg.norm(centroid-xy, axis=1)}\n')

List of distances between centroid [0.87236496 0.74034618] and each point:
          [0.21056113 0.84946149 0.83381298 0.31347176 0.40811323 0.85442416
 0.44043437 0.66736601 0.55282619 0.14813826]

List of distances between centroid [0.37243631 0.37851987] and each point:
          [0.77005698 0.29192851 0.25249753 0.60881231 0.2219568  0.24264077
 0.27374379 0.39968813 0.31728732 0.58604271]

如您所见，我们已经预测到距离最小的质心：

kmeans.predict(xy)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])


distances = np.vstack([np.linalg.norm(centroids[0]-xy, axis=1),
                     np.linalg.norm(centroids[1]-xy, axis=1)])
distances.argmin(axis=0)
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])

让我们绘制数据：质心为正方形，点为圆形，其大小与距质心的距离成反比。

现在，尽管该图正在绘制其他随机数据点，但我希望能有所帮助。

Python kmeans聚类真实数据质心

2 个答案: