在Wikipedia page上,描述了用于确定k均值中的聚类数的肘方法。 The built-in method of scipy提供了一个实现,但我不确定我是否理解它们所称的失真是如何计算的。
更确切地说,如果您绘制由方法解释的方差百分比 群集将反对群集的数量,第一个群集将 添加很多信息(解释很多变化),但在某些时候 边际增益将下降,在图中给出一个角度。
假设我的相关质心有以下几点,那么计算这个量度的好方法是什么?
points = numpy.array([[ 0, 0],
[ 0, 1],
[ 0, -1],
[ 1, 0],
[-1, 0],
[ 9, 9],
[ 9, 10],
[ 9, 8],
[10, 9],
[10, 8]])
kmeans(pp,2)
(array([[9, 8],
[0, 0]]), 0.9414213562373096)
我特别关注只计算点和质心的0.94 ..计算。我不确定是否可以使用任何内置的scipy方法,或者我必须编写自己的方法。关于如何有效地为大量积分做这些的任何建议?
简而言之,我的问题(所有相关的)如下:
编辑2:失真
from scipy.spatial.distance import cdist
D = cdist(points, centroids, 'euclidean')
sum(numpy.min(D, axis=1))
第一组点的输出是准确的。但是,当我尝试不同的设置时:
>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]])
>>> kmeans(pp, 2)
(array([[6, 7],
[1, 2]]), 1.1330618877807475)
>>> centroids = numpy.array([[6,7], [1,2]])
>>> D = cdist(points, centroids, 'euclidean')
>>> sum(numpy.min(D, axis=1))
9.0644951022459797
我猜最后一个值不匹配,因为kmeans
似乎是将数值乘以数据集中的总点数。
编辑1:差异百分比
到目前为止我的代码(应该添加到Denis的K-means实现中):
centres, xtoc, dist = kmeanssample( points, 2, nsample=2,
delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 )
print "Unique clusters: ", set(xtoc)
print ""
cluster_vars = []
for cluster in set(xtoc):
print "Cluster: ", cluster
truthcondition = ([x == cluster for x in xtoc])
distances_inside_cluster = (truthcondition * dist)
indices = [i for i,x in enumerate(truthcondition) if x == True]
final_distances = [distances_inside_cluster[k] for k in indices]
print final_distances
print np.array(final_distances).var()
cluster_vars.append(np.array(final_distances).var())
print ""
print "Sum of variances: ", sum(cluster_vars)
print "Total Variance: ", points.var()
print "Percent: ", (100 * sum(cluster_vars) / points.var())
以下是k = 2的输出:
Unique clusters: set([0, 1])
Cluster: 0
[1.0, 2.0, 0.0, 1.4142135623730951, 1.0]
0.427451660041
Cluster: 1
[0.0, 1.0, 1.0, 1.0, 1.0]
0.16
Sum of variances: 0.587451660041
Total Variance: 21.1475
Percent: 2.77787757437
在我的真实数据集上(看起来不对我!):
Sum of variances: 0.0188124746402
Total Variance: 0.00313754329764
Percent: 599.592510943
Unique clusters: set([0, 1, 2, 3])
Sum of variances: 0.0255808508714
Total Variance: 0.00313754329764
Percent: 815.314672809
Unique clusters: set([0, 1, 2, 3, 4])
Sum of variances: 0.0588210052519
Total Variance: 0.00313754329764
Percent: 1874.74720416
Unique clusters: set([0, 1, 2, 3, 4, 5])
Sum of variances: 0.0672406353655
Total Variance: 0.00313754329764
Percent: 2143.09824556
Unique clusters: set([0, 1, 2, 3, 4, 5, 6])
Sum of variances: 0.0646291452839
Total Variance: 0.00313754329764
Percent: 2059.86465055
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7])
Sum of variances: 0.0817517362176
Total Variance: 0.00313754329764
Percent: 2605.5970695
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8])
Sum of variances: 0.0912820650486
Total Variance: 0.00313754329764
Percent: 2909.34837831
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Sum of variances: 0.102119601368
Total Variance: 0.00313754329764
Percent: 3254.76309585
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Sum of variances: 0.125549475536
Total Variance: 0.00313754329764
Percent: 4001.52168834
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Sum of variances: 0.138469402779
Total Variance: 0.00313754329764
Percent: 4413.30651542
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
答案 0 :(得分:47)
就Kmeans而言,失真被用作停止标准(如果两次迭代之间的变化小于某个阈值,我们假设收敛)
如果你想从一组点和质心计算它,你可以执行以下操作(代码在MATLAB中使用pdist2
函数,但在Python / Numpy / Scipy中重写应该很简单):
% data
X = [0 1 ; 0 -1 ; 1 0 ; -1 0 ; 9 9 ; 9 10 ; 9 8 ; 10 9 ; 10 8];
% centroids
C = [9 8 ; 0 0];
% euclidean distance from each point to each cluster centroid
D = pdist2(X, C, 'euclidean');
% find closest centroid to each point, and the corresponding distance
[distortions,idx] = min(D,[],2);
结果:
% total distortion
>> sum(distortions)
ans =
9.4142135623731
我有时间玩这个..以下是应用于'Fisher Iris Dataset'的KMeans群集示例(4个功能,150个实例)。我们迭代k=1..10
,绘制肘曲线,选择K=3
作为聚类数,并显示结果的散点图。
请注意,在给定点和质心的情况下,我提供了许多计算群内方差(扭曲)的方法。 scipy.cluster.vq.kmeans
函数默认返回此度量(使用Euclidean作为距离度量计算)。您还可以使用scipy.spatial.distance.cdist
函数计算与您选择的函数的距离(假设您使用相同的距离度量获得了聚类质心:@Denis有解决方案),然后计算失真从那起。
import numpy as np
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
# load the iris dataset
fName = 'C:\\Python27\\Lib\\site-packages\\scipy\\spatial\\tests\\data\\iris.txt'
fp = open(fName)
X = np.loadtxt(fp)
fp.close()
##### cluster data into K=1..10 clusters #####
K = range(1,10)
# scipy.cluster.vq.kmeans
KM = [kmeans(X,k) for k in K]
centroids = [cent for (cent,var) in KM] # cluster centroids
#avgWithinSS = [var for (cent,var) in KM] # mean within-cluster sum of squares
# alternative: scipy.cluster.vq.vq
#Z = [vq(X,cent) for cent in centroids]
#avgWithinSS = [sum(dist)/X.shape[0] for (cIdx,dist) in Z]
# alternative: scipy.spatial.distance.cdist
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]
##### plot ###
kIdx = 2
# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12,
markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')
# scatter plot
fig = plt.figure()
ax = fig.add_subplot(111)
#ax.scatter(X[:,2],X[:,1], s=30, c=cIdx[k])
clr = ['b','g','r','c','m','y','k']
for i in range(K[kIdx]):
ind = (cIdx[kIdx]==i)
ax.scatter(X[ind,2],X[ind,1], s=30, c=clr[i], label='Cluster %d'%i)
plt.xlabel('Petal Length')
plt.ylabel('Sepal Width')
plt.title('Iris Dataset, KMeans clustering with K=%d' % K[kIdx])
plt.legend()
plt.show()
在回应评论时,我在下面使用NIST hand-written digits dataset给出了另一个完整的例子:它有1797个0到9的数字图像,每个图像大小为8×8像素。我重复上面的实验略微修改:Principal Components Analysis用于将维度从64降低到2:
import numpy as np
from scipy.cluster.vq import kmeans
from scipy.spatial.distance import cdist,pdist
from sklearn import datasets
from sklearn.decomposition import RandomizedPCA
from matplotlib import pyplot as plt
from matplotlib import cm
##### data #####
# load digits dataset
data = datasets.load_digits()
t = data['target']
# perform PCA dimensionality reduction
pca = RandomizedPCA(n_components=2).fit(data['data'])
X = pca.transform(data['data'])
##### cluster data into K=1..20 clusters #####
K_MAX = 20
KK = range(1,K_MAX+1)
KM = [kmeans(X,k) for k in KK]
centroids = [cent for (cent,var) in KM]
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
tot_withinss = [sum(d**2) for d in dist] # Total within-cluster sum of squares
totss = sum(pdist(X)**2)/X.shape[0] # The total sum of squares
betweenss = totss - tot_withinss # The between-cluster sum of squares
##### plots #####
kIdx = 9 # K=10
clr = cm.spectral( np.linspace(0,1,10) ).tolist()
mrk = 'os^p<dvh8>+x.'
# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(KK, betweenss/totss*100, 'b*-')
ax.plot(KK[kIdx], betweenss[kIdx]/totss*100, marker='o', markersize=12,
markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Percentage of variance explained (%)')
plt.title('Elbow for KMeans clustering')
# show centroids for K=10 clusters
plt.figure()
for i in range(kIdx+1):
img = pca.inverse_transform(centroids[kIdx][i]).reshape(8,8)
ax = plt.subplot(3,4,i+1)
ax.set_xticks([])
ax.set_yticks([])
plt.imshow(img, cmap=cm.gray)
plt.title( 'Cluster %d' % i )
# compare K=10 clustering vs. actual digits (PCA projections)
fig = plt.figure()
ax = fig.add_subplot(121)
for i in range(10):
ind = (t==i)
ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='%d'%i)
plt.legend()
plt.title('Actual Digits')
ax = fig.add_subplot(122)
for i in range(kIdx+1):
ind = (cIdx[kIdx]==i)
ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='C%d'%i)
plt.legend()
plt.title('K=%d clusters'%KK[kIdx])
plt.show()
您可以看到某些群集实际上如何与可区分的数字相对应,而其他群集与单个数字不匹配。
注意:K-means中包含scikit-learn
的实现(以及许多其他群集算法和各种clustering metrics)。 Here是另一个类似的例子。
答案 1 :(得分:6)
一个简单的集群措施:
1)从每个点到最近的聚类中心绘制“旭日”光线,
2)查看所有光线的长度 - 距离(点,中心,公制= ......)。
对于metric="sqeuclidean"
和1个群集,
平均长度平方是总方差X.var()
;对于2个集群,它更少......到N个集群,长度全部为0。
“解释的方差百分比”是100% - 这个平均值。
此代码,在is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means下:
def distancestocentres( X, centres, metric="euclidean", p=2 ):
""" all distances X -> nearest centre, any metric
euclidean2 (~ withinss) is more sensitive to outliers,
cityblock (manhattan, L1) less sensitive
"""
D = cdist( X, centres, metric=metric, p=p ) # |X| x |centres|
return D.min(axis=1) # all the distances
与任何长数字列表一样,可以通过各种方式查看这些距离:np.mean(),np.histogram()...绘图,可视化并不容易。
另见stats.stackexchange.com/questions/tagged/clustering,特别是
How to tell if data is “clustered” enough for clustering algorithms to produce meaningful results?