在使用UNION
对距离矩阵进行聚类,并使用scipy.cluster.hierarchy.linkage
将每个样本分配到群集后,我想从每个群集中提取一个元素,该群集最接近该群集&#39 ; s centroid。
scipy.cluster.hierarchy.cut_tree
中的centroid
关联规则混淆。我已经进行了聚类,只想访问最接近质心的元素。答案 0 :(得分:2)
使用KD-Tree最有效地计算最近邻居。 E.g:
from scipy.spatial import cKDTree
def find_k_closest(centroids, data, k=1, distance_norm=2):
"""
Arguments:
----------
centroids: (M, d) ndarray
M - number of clusters
d - number of data dimensions
data: (N, d) ndarray
N - number of data points
k: int (default 1)
nearest neighbour to get
distance_norm: int (default 2)
1: Hamming distance (x+y)
2: Euclidean distance (sqrt(x^2 + y^2))
np.inf: maximum distance in any dimension (max((x,y)))
Returns:
-------
indices: (M,) ndarray
values: (M, d) ndarray
"""
kdtree = cKDTree(data, leafsize=leafsize)
distances, indices = kdtree.query(centroids, k, p=distance_norm)
if k > 1:
indices = indices[:,-1]
values = data[indices]
return indices, values
indices, values = find_k_closest(centroids, data)
答案 1 :(得分:1)
Paul解决方案适用于多维数组。在更具体的情况下,您有一个距离矩阵dm
,其中距离以“非平凡”方式计算(例如每对对象首先在3D中对齐,然后计算RMSD),我最终从每个集群中选择了与集群中其他元素的距离总和最小的元素, aka。集群的medoid。 (参见下面this回答的讨论。)这就是我如何拥有距离矩阵dm
以及对象名称列表的顺序names
:
import numpy as np
import scipy.spatial.distance as spd
import scipy.cluster.hierarchy as sch
# Square form of distance matrix
sq=spd.squareform(dm)
# Perform clustering, capture linkage object
clusters=sch.linkage(dm,method=linkage)
# List of cluster assignments
assignments=sch.cut_tree(clusters,height=rmsd_cutoff)
# Store object names and assignments as zip object (list of tuples)
nameList=list(zip(names,assignments))
### Extract models closest to cluster centroids
counter=0
while counter<num_Clusters+1:
# Create mask from the list of assignments for extracting submatrix of the cluster
mask=np.array([1 if i==counter else 0 for i in assignments],dtype=bool)
# Take the index of the column with the smallest sum of distances from the submatrix
idx=np.argmin(sum(sq[:,mask][mask,:]))
# Extract names of cluster elements from nameList
sublist=[name for (name, cluster) in nameList if cluster==counter]
# Element closest to centroid
centroid=sublist[idx]