获取最接近群集质心的元素

时间:2016-09-29 09:30:59

标签: python numpy scipy scikit-learn

在使用UNION对距离矩阵进行聚类,并使用scipy.cluster.hierarchy.linkage将每个样本分配到群集后,我想从每个群集中提取一个元素,该群集最接近该群集&#39 ; s centroid。

  • 如果有一个现成的功能,我会是最开心的,但缺乏这个功能:
  • 已经提出了一些建议here来提取质心本身,但不是最接近质心的元素。
  • 请注意,不要将其与scipy.cluster.hierarchy.cut_tree中的centroid关联规则混淆。我已经进行了聚类,只想访问最接近质心的元素。

2 个答案:

答案 0 :(得分:2)

使用KD-Tree最有效地计算最近邻居。 E.g:

from scipy.spatial import cKDTree

def find_k_closest(centroids, data, k=1, distance_norm=2):
    """
    Arguments:
    ----------
        centroids: (M, d) ndarray
            M - number of clusters
            d - number of data dimensions
        data: (N, d) ndarray
            N - number of data points
        k: int (default 1)
            nearest neighbour to get
        distance_norm: int (default 2)
            1: Hamming distance (x+y)
            2: Euclidean distance (sqrt(x^2 + y^2))
            np.inf: maximum distance in any dimension (max((x,y)))

    Returns:
    -------
        indices: (M,) ndarray
        values: (M, d) ndarray
    """

    kdtree = cKDTree(data, leafsize=leafsize)
    distances, indices = kdtree.query(centroids, k, p=distance_norm)
    if k > 1:
        indices = indices[:,-1]
    values = data[indices]
    return indices, values

indices, values = find_k_closest(centroids, data)

答案 1 :(得分:1)

上面的

Paul解决方案适用于多维数组。在更具体的情况下,您有一个距离矩阵dm,其中距离以“非平凡”方式计算(例如每对对象首先在3D中对齐,然后计算RMSD),我最终从每个集群中选择了与集群中其他元素的距离总和最小的元素, aka。集群的medoid。 (参见下面this回答的讨论。)这就是我如何拥有距离矩阵dm以及对象名称列表的顺序names

import numpy as np
import scipy.spatial.distance as spd
import scipy.cluster.hierarchy as sch

# Square form of distance matrix
sq=spd.squareform(dm)
# Perform clustering, capture linkage object
clusters=sch.linkage(dm,method=linkage)
# List of cluster assignments
assignments=sch.cut_tree(clusters,height=rmsd_cutoff)
# Store object names and assignments as zip object (list of tuples)
nameList=list(zip(names,assignments))

### Extract models closest to cluster centroids
counter=0
while counter<num_Clusters+1:

    # Create mask from the list of assignments for extracting submatrix of the cluster
    mask=np.array([1 if i==counter else 0 for i in assignments],dtype=bool)

    # Take the index of the column with the smallest sum of distances from the submatrix
    idx=np.argmin(sum(sq[:,mask][mask,:]))

    # Extract names of cluster elements from nameList
    sublist=[name for (name, cluster) in nameList if cluster==counter]

    # Element closest to centroid
    centroid=sublist[idx]