MATLAB有一个很好的silhouette function来帮助评估k-means的聚类数量。还有Python的Numpy / Scipy的等价物吗?
答案 0 :(得分:16)
我在MATLAB和Python / Numpy中提供了一个示例silhouette实现(请记住,我在MATLAB中更流利):
function s = mySilhouette(X, IDX)
%# X : matrix of size N-by-p, data where rows are instances
%# IDX: vector of size N, cluster index of each instance (starting from 1)
%# s : vector of size N, silhouette score value of each instance
N = size(X,1); %# number of instances
K = numel(unique(IDX)); %# number of clusters
%# compute pairwise distance matrix
D = squareform( pdist(X,'euclidean').^2 );
%# indices belonging to each cluster
kIndices = accumarray(IDX, 1:N, [K 1], @(x){sort(x)});
%# compute a,b,s for each instance
%# a(i): average distance from i to all other data within the same cluster.
%# b(i): lowest average dist from i to the data of another single cluster
a = zeros(N,1);
b = zeros(N,1);
for i=1:N
ind = kIndices{IDX(i)}; ind = ind(ind~=i);
a(i) = mean( D(i,ind) );
b(i) = min( cellfun(@(ind) mean(D(i,ind)), kIndices([1:K]~=IDX(i))) );
end
s = (b-a) ./ max(a,b);
end
为了模拟MATLAB中silhouette函数的绘图,我们按照簇对轮廓值进行分组,在每个轮廓中进行排序,然后水平绘制条形图。 MATLAB添加NaN
来分隔不同簇的条形图,我发现简单地对条形码进行颜色编码更容易:
%# sample data
load fisheriris
X = meas;
N = size(X,1);
%# cluster and compute silhouette score
K = 3;
[IDX,C] = kmeans(X, K, 'distance','sqEuclidean');
s = mySilhouette(X, IDX);
%# plot
[~,ord] = sortrows([IDX s],[1 -2]);
indices = accumarray(IDX(ord), 1:N, [K 1], @(x){sort(x)});
ytick = cellfun(@(ind) (min(ind)+max(ind))/2, indices);
ytickLabels = num2str((1:K)','%d'); %#'
h = barh(1:N, s(ord),'hist');
set(h, 'EdgeColor','none', 'CData',IDX(ord))
set(gca, 'CLim',[1 K], 'CLimMode','manual')
set(gca, 'YDir','reverse', 'YTick',ytick, 'YTickLabel',ytickLabels)
xlabel('Silhouette Value'), ylabel('Cluster')
%# compare against SILHOUETTE
figure, silhouette(X,IDX)
以下是我在Python中提出的内容:
import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import pdist, squareform
from sklearn import datasets
import matplotlib.pyplot as plt
from matplotlib import cm
def silhouette(X, cIDX):
"""
Computes the silhouette score for each instance of a clustered dataset,
which is defined as:
s(i) = (b(i)-a(i)) / max{a(i),b(i)}
with:
-1 <= s(i) <= 1
Args:
X : A M-by-N array of M observations in N dimensions
cIDX : array of len M containing cluster indices (starting from zero)
Returns:
s : silhouette value of each observation
"""
N = X.shape[0] # number of instances
K = len(np.unique(cIDX)) # number of clusters
# compute pairwise distance matrix
D = squareform(pdist(X))
# indices belonging to each cluster
kIndices = [np.flatnonzero(cIDX==k) for k in range(K)]
# compute a,b,s for each instance
a = np.zeros(N)
b = np.zeros(N)
for i in range(N):
# instances in same cluster other than instance itself
a[i] = np.mean( [D[i][ind] for ind in kIndices[cIDX[i]] if ind!=i] )
# instances in other clusters, one cluster at a time
b[i] = np.min( [np.mean(D[i][ind])
for k,ind in enumerate(kIndices) if cIDX[i]!=k] )
s = (b-a)/np.maximum(a,b)
return s
def main():
# load Iris dataset
data = datasets.load_iris()
X = data['data']
# cluster and compute silhouette score
K = 3
C, cIDX = kmeans2(X, K)
s = silhouette(X, cIDX)
# plot
order = np.lexsort((-s,cIDX))
indices = [np.flatnonzero(cIDX[order]==k) for k in range(K)]
ytick = [(np.max(ind)+np.min(ind))/2 for ind in indices]
ytickLabels = ["%d" % x for x in range(K)]
cmap = cm.jet( np.linspace(0,1,K) ).tolist()
clr = [cmap[i] for i in cIDX[order]]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.barh(range(X.shape[0]), s[order], height=1.0,
edgecolor='none', color=clr)
ax.set_ylim(ax.get_ylim()[::-1])
plt.yticks(ytick, ytickLabels)
plt.xlabel('Silhouette Value')
plt.ylabel('Cluster')
plt.show()
if __name__ == '__main__':
main()
正如其他人所指出的那样,scikit-learn从此加入了自己的silhouette metric implementation。要在上面的代码中使用它,请将对自定义silhouette
函数的调用替换为:
from sklearn.metrics import silhouette_samples
...
#s = silhouette(X, cIDX)
s = silhouette_samples(X, cIDX) # <-- scikit-learn function
...
其余的代码仍可按原样用于生成完全相同的图。
答案 1 :(得分:0)
我看了,但是我找不到numpy / scipy的剪影功能,我甚至看了pylab和matplotlib。我想你必须自己实施它。
我可以指向http://orange.biolab.si/trac/browser/trunk/orange/orngClustering.py?rev=7462。它有一些实现轮廓功能的功能。
希望这有帮助。
答案 2 :(得分:0)
这有点晚了,但值得一提的是,scikits-learn现在实现了一个轮廓功能。请参阅their documentation page或直接查看source code。