1）MATLAB

Question

MATLAB有一个很好的silhouette function来帮助评估k-means的聚类数量。还有Python的Numpy / Scipy的等价物吗？

Answer 1

我在MATLAB和Python / Numpy中提供了一个示例silhouette实现（请记住，我在MATLAB中更流利）：

1）MATLAB

function s = mySilhouette(X, IDX)
    %# X  : matrix of size N-by-p, data where rows are instances
    %# IDX: vector of size N, cluster index of each instance (starting from 1)
    %# s  : vector of size N, silhouette score value of each instance

    N = size(X,1);            %# number of instances
    K = numel(unique(IDX));   %# number of clusters

    %# compute pairwise distance matrix
    D = squareform( pdist(X,'euclidean').^2 );

    %# indices belonging to each cluster
    kIndices = accumarray(IDX, 1:N, [K 1], @(x){sort(x)});

    %# compute a,b,s for each instance
    %# a(i): average distance from i to all other data within the same cluster.
    %# b(i): lowest average dist from i to the data of another single cluster
    a = zeros(N,1);
    b = zeros(N,1);
    for i=1:N
        ind = kIndices{IDX(i)}; ind = ind(ind~=i);
        a(i) = mean( D(i,ind) );
        b(i) = min( cellfun(@(ind) mean(D(i,ind)), kIndices([1:K]~=IDX(i))) );
    end
    s = (b-a) ./ max(a,b);
end

为了模拟MATLAB中silhouette函数的绘图，我们按照簇对轮廓值进行分组，在每个轮廓中进行排序，然后水平绘制条形图。 MATLAB添加NaN来分隔不同簇的条形图，我发现简单地对条形码进行颜色编码更容易：

%# sample data
load fisheriris
X = meas;
N = size(X,1);

%# cluster and compute silhouette score
K = 3;
[IDX,C] = kmeans(X, K, 'distance','sqEuclidean');
s = mySilhouette(X, IDX);

%# plot
[~,ord] = sortrows([IDX s],[1 -2]);
indices = accumarray(IDX(ord), 1:N, [K 1], @(x){sort(x)});
ytick = cellfun(@(ind) (min(ind)+max(ind))/2, indices);
ytickLabels = num2str((1:K)','%d');           %#'

h = barh(1:N, s(ord),'hist');
set(h, 'EdgeColor','none', 'CData',IDX(ord))
set(gca, 'CLim',[1 K], 'CLimMode','manual')
set(gca, 'YDir','reverse', 'YTick',ytick, 'YTickLabel',ytickLabels)
xlabel('Silhouette Value'), ylabel('Cluster')

%# compare against SILHOUETTE
figure, silhouette(X,IDX)

mySilhouette

2）Python

以下是我在Python中提出的内容：

import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn import datasets
import matplotlib.pyplot as plt
from matplotlib import cm

def silhouette(X, cIDX):
    """
    Computes the silhouette score for each instance of a clustered dataset,
    which is defined as:
        s(i) = (b(i)-a(i)) / max{a(i),b(i)}
    with:
        -1 <= s(i) <= 1

    Args:
        X    : A M-by-N array of M observations in N dimensions
        cIDX : array of len M containing cluster indices (starting from zero)

    Returns:
        s    : silhouette value of each observation
    """

    N = X.shape[0]              # number of instances
    K = len(np.unique(cIDX))    # number of clusters

    # compute pairwise distance matrix
    D = squareform(pdist(X))

    # indices belonging to each cluster
    kIndices = [np.flatnonzero(cIDX==k) for k in range(K)]

    # compute a,b,s for each instance
    a = np.zeros(N)
    b = np.zeros(N)
    for i in range(N):
        # instances in same cluster other than instance itself
        a[i] = np.mean( [D[i][ind] for ind in kIndices[cIDX[i]] if ind!=i] )
        # instances in other clusters, one cluster at a time
        b[i] = np.min( [np.mean(D[i][ind]) 
                        for k,ind in enumerate(kIndices) if cIDX[i]!=k] )
    s = (b-a)/np.maximum(a,b)

    return s

def main():
    # load Iris dataset
    data = datasets.load_iris()
    X = data['data']

    # cluster and compute silhouette score
    K = 3
    C, cIDX = kmeans2(X, K)
    s = silhouette(X, cIDX)

    # plot
    order = np.lexsort((-s,cIDX))
    indices = [np.flatnonzero(cIDX[order]==k) for k in range(K)]
    ytick = [(np.max(ind)+np.min(ind))/2 for ind in indices]
    ytickLabels = ["%d" % x for x in range(K)]
    cmap = cm.jet( np.linspace(0,1,K) ).tolist()
    clr = [cmap[i] for i in cIDX[order]]

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.barh(range(X.shape[0]), s[order], height=1.0, 
            edgecolor='none', color=clr)
    ax.set_ylim(ax.get_ylim()[::-1])
    plt.yticks(ytick, ytickLabels)
    plt.xlabel('Silhouette Value')
    plt.ylabel('Cluster')
    plt.show()

if __name__ == '__main__':
    main()

python_mySilhouette

更新

正如其他人所指出的那样，scikit-learn从此加入了自己的silhouette metric implementation。要在上面的代码中使用它，请将对自定义silhouette函数的调用替换为：

from sklearn.metrics import silhouette_samples

...

#s = silhouette(X, cIDX)
s = silhouette_samples(X, cIDX)    # <-- scikit-learn function

...

其余的代码仍可按原样用于生成完全相同的图。

Answer 2

我看了，但是我找不到numpy / scipy的剪影功能，我甚至看了pylab和matplotlib。我想你必须自己实施它。

我可以指向http://orange.biolab.si/trac/browser/trunk/orange/orngClustering.py?rev=7462。它有一些实现轮廓功能的功能。

希望这有帮助。

Answer 3

这有点晚了，但值得一提的是，scikits-learn现在实现了一个轮廓功能。请参阅their documentation page或直接查看source code。

相当于Matlab的集群质量函数？

3 个答案:

1）MATLAB

2）Python

更新