我正在尝试在Python中实现K-means算法(我知道有这样的库,但我想学习如何自己实现它。)这是我遇到的问题:
def AssignPoints(points, centroids):
"""
Takes two arguments:
points is a numpy array such that points.shape = m , n where m is number of examples,
and n is number of dimensions.
centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
k < m should hold.
Returns:
numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
"""
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
distances = np.hypot(*temp)
return distances.argmin(axis=1)
这个函数的目的,给定n维空间中的m个点和n维空间中的k个质心,产生一个numpy数组(x1 x2 x3 x4 ... xm),其中x1是质心的索引,它最接近于第一点。这工作正常,直到我尝试使用4维示例。当我尝试放置4维示例时,我收到此错误:
File "/path/to/the/kmeans.py", line 28, in AssignPoints
distances = np.hypot(*temp)
ValueError: invalid number of arguments
我该如何解决这个问题,或者如果我不能解决,你如何建议我计算我在这里计算的内容?
def AssignPoints(points, centroids):
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
for i in xrange(len(temp)):
temp[i] = temp[i] ** 2
distances = np.add.reduce(temp) ** 0.5
return distances.argmin(axis=1)
答案 0 :(得分:3)
试试这个:
np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)
或者:
diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)
不要问它在做什么:D
编辑:不,开个玩笑。中间的广播(points[np.newaxis] - centroids[:,np.newaxis]
)是从原始广播“制作”两个3D阵列。结果是每个“平面”包含所有点和一个质心之间的差异。我们称之为diffs
。
然后我们进行通常的操作来计算欧氏距离(差异平方的平方根):np.sqrt((diffs ** 2).sum(axis=2))
。我们最终得到一个(k, m)
矩阵,其中第0行包含到centroids[0]
的距离等。因此,.argmin(axis=0)
会为您提供所需的结果。
答案 1 :(得分:0)
您需要定义使用hypot的距离函数。通常用K-means表示 距离=总和((点形心)^ 2) 这里有一些matlab代码可以做到......如果你不能,我可以移植它,但是试一试。就像你说的那样,只有学习的方式。
function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
% idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
% in idx for a dataset X where each row is a single example. idx = m x 1
% vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set K
K = size(centroids, 1);
[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);
% Go over every example, find its closest centroid, and store
% the index inside idx at the appropriate location.
% Concretely, idx(i) should contain the index of the centroid
% closest to example i. Hence, it should be a value in the
% range 1..K
%
for loop=1:numberOfExamples
Distance = sum(bsxfun(@minus,X(loop,:),centroids).^2,2);
[value index] = min(Distance);
idx(loop) = index;
end;
end
更新
这应该返回距离,注意上面的matlab代码只返回最近质心的距离(和索引)...你的函数返回所有距离,如下所示。
def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance