我的Davies-Bouldin索引的python实现是否正确?

时间:2017-12-30 18:08:56

标签: python statistics cluster-analysis metrics data-science

我正在尝试用Python计算Davies-Bouldin Index

以下是以下代码尝试重现的步骤。

5个步骤

  1. 对于每个群集,计算每个点与质心之间的欧氏距离
  2. 对于每个群集,计算这些距离的平均值
  3. 对于每对簇,计算其质心之间的欧氏距离
  4. 然后,

    1. 对于每对聚类,计算到各自质心的平均距离之和(在步骤2中计算)并将其除以它们之间的距离(在步骤3中计算)。
    2. 最后,

      1. 计算所有这些分区(=所有索引)的平均值以获得整个聚类的Davies-Bouldin索引
      2. 代码

        def daviesbouldin(X, labels, centroids):
        
            import numpy as np
            from scipy.spatial.distance import pdist, euclidean
        
            nbre_of_clusters = len(centroids) #Get the number of clusters
            distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
            distances_means = [] #Store the mean of these distances
            DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
            second_cluster_idx = [] #Store index of the second cluster of each pair
            first_cluster_idx = 0 #Set index of first cluster of each pair to 0
        
            # Step 1: Compute euclidean distances between each point of a cluster to their centroid
            for cluster in range(nbre_of_clusters):
                for point in range(X[labels == cluster].shape[0]):
                    distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))
        
            # Step 2: Compute the mean of these distances
            for e in distances:
                distances_means.append(np.mean(e))
        
            # Step 3: Compute euclidean distances between each pair of centroid
            ctrds_distance = pdist(centroids) 
        
            # Tricky step 4: Compute Davies-Bouldin index of each pair of cluster   
            for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
                second_cluster_idx.append(e)
                if second_cluster_idx[i-1] == nbre_of_clusters - 1:
                    first_cluster_idx += 1
                DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])
        
            # Step 5: Compute the mean of all DB_indexes   
            print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes)) 
        

        参数:

        • X是数据
        • labels,是由聚类算法计算的标签(即:kmeans)
        • centroids是每个群集质心的坐标(即:cluster_centers_

        另外,请注意我使用的是Python 3

        问题1 :每对质心之间的欧氏距离的计算是否正确(步骤3)?

        问题2 :我的第4步实施是否正确?

        问题3 :我是否需要规范化内部和群集间距离?

        有关第4步的进一步说明

        假设我们有10个集群。 循环应该计算每对集群的DB索引。

        在第一次迭代时:

        • 求和簇1的距离内平均值(distances_means的索引0)和簇2的距离内平均值(distances_means的索引1)
        • 将此总和除以2个集群之间的距离(ctrds_distance的索引0)

        在第二次迭代:

        • 求和簇1的距离内平均值(distances_means的索引0)和簇3的距离内平均值(distances_means的索引2)
        • 将此总和除以2个群集之间的距离(ctrds_distance的索引1)

        依旧......

        以10个集群为例,完整的迭代过程应如下所示:

        intra-cluster distance intra-cluster distance       distance between their
              of cluster:             of cluster:           centroids(storage num):
                 0           +             1            /             0
                 0           +             2            /             1
                 0           +             3            /             2
                 0           +             4            /             3
                 0           +             5            /             4
                 0           +             6            /             5
                 0           +             7            /             6
                 0           +             8            /             7
                 0           +             9            /             8
                 1           +             2            /             9
                 1           +             3            /             10
                 1           +             4            /             11
                 1           +             5            /             12
                 1           +             6            /             13
                 1           +             7            /             14
                 1           +             8            /             15
                 1           +             9            /             16
                 2           +             3            /             17
                 2           +             4            /             18
                 2           +             5            /             19
                 2           +             6            /             20
                 2           +             7            /             21
                 2           +             8            /             22
                 2           +             9            /             23
                 3           +             4            /             24
                 3           +             5            /             25
                 3           +             6            /             26
                 3           +             7            /             27
                 3           +             8            /             28
                 3           +             9            /             29
                 4           +             5            /             30
                 4           +             6            /             31
                 4           +             7            /             32
                 4           +             8            /             33
                 4           +             9            /             34
                 5           +             6            /             35
                 5           +             7            /             36
                 5           +             8            /             37
                 5           +             9            /             38
                 6           +             7            /             39
                 6           +             8            /             40
                 6           +             9            /             41
                 7           +             8            /             42
                 7           +             9            /             43
                 8           +             9            /             44
        

        这里的问题是我不太确定distances_means的索引是否与ctrds_distance的索引匹配。

        换句话说,我不确定计算出的第一个簇间距离是否对应于簇1和簇2之间的距离。计算出的第二个簇间距离对应于簇3和簇1之间的距离......依此类推,按照上述模式。

        简而言之:我担心我将群内距离分成一簇不相应的群集间距离。

        欢迎任何帮助!

4 个答案:

答案 0 :(得分:2)

以上是Davies-Bouldin指数天真实施的更短,更快的修正版本。

def DaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k in range(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]
    variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
    db = []

    for i in range(n_cluster):
        for j in range(n_cluster):
            if j != i:
                db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))

    return(np.max(db) / n_cluster)

回答我自己的问题:

  • 初稿上的计数器(第4步)是正确但不相关
  • 无需规范化内部和群集间距离
  • 计算欧几里德距离时出现错误

请注意,您可以找到尝试改进此索引的创新方法,特别是用“Cylindrical”距离替换欧几里德距离的“New Version of Davies-Bouldin Index”。

答案 1 :(得分:1)

感谢您的代码和修订 - 真的帮助我开始了。更短,更快的版本并不完全正确。我对其进行了修正,以正确平均每个群集中最相似群集的色散分数。

有关原始算法和解释,请参阅https://www.researchgate.net/publication/224377470_A_Cluster_Separation_Measure

  

DBI是每个群集的相似性度量的平均值   最相似的集群。

def DaviesBouldin(X, labels):
    n_cluster = len(np.bincount(labels))
    cluster_k = [X[labels == k] for k in range(n_cluster)]
    centroids = [np.mean(k, axis = 0) for k in cluster_k]

    # calculate cluster dispersion
    S = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
    Ri = []

    for i in range(n_cluster):
        Rij = []
        # establish similarity between each cluster and all other clusters
        for j in range(n_cluster):
            if j != i:
                r = (S[i] + S[j]) / euclidean(centroids[i], centroids[j])
                Rij.append(r)
         # select Ri value of most similar cluster
         Ri.append(max(Rij)) 

    # get mean of all Ri values    
    dbi = np.mean(Ri)

    return dbi

答案 2 :(得分:0)

感谢您的实施。我只有一个问题:最后一行是否遗漏了一个分区。在最后一步中,max(db)的值应除以实现的簇数。

def DaviesBouldin(Daten, DatenLabels):
n_cluster = len(np.bincount(DatenLabels)) 
cluster_k = [Daten[DatenLabels == k] for k in range(n_cluster)] 
centroids = [np.mean(k, axis = 0) for k in cluster_k] 
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)] # mittlere Entfernung zum jeweiligen Clusterzentrum
db = []

for i in range(n_cluster):
    for j in range(n_cluster):
        if j != i:
            db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]) / n_cluster)
return(np.max(db))

也许我监督那个部门,因为我是Python的新手。但是在我的图形中(我在一系列聚类上进行迭代),DB.max的值在开始时非常低,之后又增加。在按照簇的数量进行缩放后,图形看起来更好(开始时的DB.max值很高,并且随着簇的数量不断增加而不断下降)。

祝你好运

答案 3 :(得分:0)

这比下面的代码快20倍,所有计算都以numpy完成。

import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform

def db_index(X, y):
    """
    Davies-Bouldin index is an internal evaluation method for
    clustering algorithms. Lower values indicate tighter clusters that 
    are better separated.
    """
    # get unique labels
    if y.ndim == 2:
        y = np.argmax(axis=1)
    uniqlbls = np.unique(y)
    n = len(uniqlbls)
    # pre-calculate centroid and sigma
    centroid_arr = np.empty((n, X.shape[1]))
    sigma_arr = np.empty((n,1))
    dbi_arr = np.empty((n,n))
    mask_arr = np.invert(np.eye(n, dtype='bool'))
    for i,k in enumerate(uniqlbls):
        Xk = X[np.where(y==k)[0],...]
        Ak = np.mean(Xk, axis=0)
        centroid_arr[i,...] = Ak
        sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
    # compute pairwise centroid distances, make diagonal elements non-zero
    centroid_pdist_arr = squareform(pdist(centroid_arr)) + np.eye(n)
    # compute pairwise sigma sums
    sigma_psum_arr = squareform(pdist(sigma_arr, lambda u,v: u+v))
    # divide 
    dbi_arr = np.divide(sigma_psum_arr, centroid_pdist_arr)
    # get mean of max of off-diagonal elements
    dbi_arr = np.where(mask_arr, dbi_arr, 0)
    dbi = np.mean(np.max(dbi_arr, axis=1))
    return dbi


这是使用numpy 1.14,scipy 1.1.0和python 3的实现。计算速度没有太大提高,但内存占用空间应略小。

import numpy as np
from scipy.spatial.distance import euclidean, cdist, pdist, squareform

def db_index(X, y):
    """
    Davies-Bouldin index is an internal evaluation method for
    clustering algorithms. Lower values indicate tighter clusters that 
    are better separated.

    Arguments
    ----------
    X : 2D array (n_samples, embed_dim)
    Vector for each example.

    y : 1D array (n_samples,) or 2D binary array (n_samples, n_classes)
    True labels for each example.

    Returns
    ----------
    dbi : float
        Calculated Davies-Bouldin index.
    """
    # get unique labels
    if y.ndim == 2:
        y = np.argmax(axis=1)
    uniqlbls = np.unique(y)
    n = len(uniqlbls)
    # pre-calculate centroid and sigma
    centroid_arr = np.empty((n, X.shape[1]))
    sigma_arr = np.empty(n)
    for i,k in enumerate(uniqlbls):
        Xk = X[np.where(y==k)[0],...]
        Ak = np.mean(Xk, axis=0)
        centroid_arr[i,...] = Ak
        sigma_arr[i,...] = np.mean(cdist(Xk, Ak.reshape(1,-1)))
    # loop over non-duplicate cluster pairs
    dbi = 0
    for i in range(n):
        max_Rij = 0
        for j in range(n):
            if j != i:
                Rij = np.divide(sigma_arr[i] + sigma_arr[j], 
                                euclidean(centroid_arr[i,...], centroid_arr[j,...]))
                if Rij > max_Rij:
                    max_Rij = Rij
        dbi += max_Rij
    return dbi/n