我正在处理一个非常高维的数据集,并对其执行了k均值聚类。我试图找到每个质心最接近的20个点。数据集(X_emb)的尺寸为10 x2816。提供的代码用于查找与每个质心最接近的单个点。注释掉的代码是我发现的潜在解决方案,但是我无法使其正确工作。
import numpy as np
import pickle as pkl
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.neighbors import NearestNeighbors
from visualization.make_video_v2 import make_video_from_numpy
from scipy.spatial import cKDTree
n_s_train = 10000
df = pkl.load(open('cluster_data/mixed_finetuning_data.pkl', 'rb'))
N = len(df)
X = []
X_emb = []
for i in range(N):
play = df.iloc[i]
if df.iloc[i].label == 1:
X_emb.append(play['embedding'])
X.append(play['input'])
X_emb = np.array(X_emb)
kmeans = KMeans(n_clusters=10)
kmeans.fit(X_emb)
results = kmeans.cluster_centers_
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
# def find_k_closest(centroids, data, k=1, distance_norm=2):
# kdtree = cKDTree(data, leafsize=30)
# distances, indices = kdtree.query(centroids, k, p=distance_norm)
# if k > 1:
# indices = indices[:,-1]
# values = data[indices]
# return indices, values
# indices, values = find_k_closest(results, X_emb)
答案 0 :(得分:0)
您可以通过sklearn的NearestNeighbors
类:
from sklearn.neighbors import NearestNeighbors
def find_k_closest(centroids, data):
nns = {}
neighbors = NearesNieghbors(n_neighbors=20).fit(data)
for center in centroids:
nns[center] = neighbors.kneighbors(center, return_distance=false)
return nns
nns词典应包含中心作为键,并包含邻居列表作为值
答案 1 :(得分:0)
您可以使用成对距离来计算X_emb中每个点与质心的每个点的距离,然后使用numpy查找最少20个元素的索引,最后从X_emb中获取它们
from sklearn.metrics import pairwise_distances
distances = pairwise_distances(centroids, X_emb, metric='euclidean')
ind = [np.argpartition(i, 20)[:20] for i in distances]
closest = [X_emb[indexes] for indexes in ind]
最接近的形状为(质心数x 20)