Question

我有一个更新K-means算法中的质心（平均值）的函数。我运行了一个分析器，发现这个函数占用了大量的计算时间。

看起来像：

def updateCentroid(self, label):
    X=[]; Y=[]
    for point in self.clusters[label].points:
        X.append(point.x)
        Y.append(point.y)
    self.clusters[label].centroid.x = numpy.mean(X)
    self.clusters[label].centroid.y = numpy.mean(Y)

所以我在思考，有没有更有效的方法来计算这些点的平均值？如果没有，是否有更优雅的方式来制定它？ ;）

编辑：

感谢所有出色的回复！我想也许我可以累计计算平均值，使用类似的东西： alt text

其中x_bar（t）是新均值，x_bar（t-1）是旧均值。

这将产生与此类似的功能：

def updateCentroid(self, label):
    cluster = self.clusters[label]
    n = len(cluster.points)
    cluster.centroid.x *= (n-1) / n
    cluster.centroid.x += cluster.points[n-1].x / n
    cluster.centroid.y *= (n-1) / n
    cluster.centroid.y += cluster.points[n-1].y / n

它没有真正起作用，但你认为这可能适用于一些tweeking吗？

Answer 1

K-means算法已在scipy.cluster.vq中实现。如果您尝试更改该实现的某些内容，那么我建议从那里开始研究代码：

In [62]: import scipy.cluster.vq as scv
In [64]: scv.__file__
Out[64]: '/usr/lib/python2.6/dist-packages/scipy/cluster/vq.pyc'

PS。因为您发布的算法将数据保存在dict（self.clusters）和属性查找（.points）后面，所以您不得不使用慢速Python循环来获取数据。通过坚持使用numpy阵列可以实现主要的速度增益。有关更好的数据结构的想法，请参阅k-means聚类的scipy实现。

Answer 2

为什么不避免构建额外的数组？

def updateCentroid(self, label):
  sumX=0; sumY=0
  N = len( self.clusters[label].points)
  for point in self.clusters[label].points:
    sumX += point.x
    sumY += point.y
  self.clusters[label].centroid.x = sumX/N
  self.clusters[label].centroid.y = sumY/N

Answer 3

你的功能中昂贵的部分肯定是对点的迭代。通过使self.clusters[label].points成为一个numpy数组来完全避免它，然后直接在其上计算均值。例如，如果点包含在一维数组中连接的X和Y坐标：

points = self.clusters[label].points
x_mean = numpy.mean(points[0::2])
y_mean = numpy.mean(points[1::2])

Answer 4

没有额外的名单：

def updateCentroid(self, label):
    self.clusters[label].centroid.x = numpy.fromiter(point.x for point in self.clusters[label].points, dtype = np.float).mean()
    self.clusters[label].centroid.y = numpy.fromiter(point.y for point in self.clusters[label].points, dtype = np.float).mean()

Answer 5

numpy mean的附加功能可能会增加一些开销。

>>> def myMean(itr):
...   c = t = 0
...   for item in itr:
...     c += 1
...     t += item
...   return t / c
...
>>> import timeit
>>> a = range(20)
>>> t1 = timeit.Timer("myMean(a)","from __main__ import myMean, a")
>>> t1.timeit()
6.8293311595916748
>>> t2 = timeit.Timer("average(a)","from __main__ import a; from numpy import average")
>>> t2.timeit()
69.697283029556274
>>> t3 = timeit.Timer("average(array(a))","from __main__ import a; from numpy import average, array")
>>> t3.timeit()
51.65147590637207
>>> t4 = timeit.Timer("fromiter(a,npfloat).mean()","from __main__ import a; from numpy import average, fromiter,float as npfloat")
>>> t4.timeit()
18.513712167739868

使用fromiter时，看起来numpy的表现最佳。

Answer 6

好的，我想出了一个移动平均解决方案，它快速而不改变数据结构：

def updateCentroid(self, label):
    cluster = self.clusters[label]
    n = len(cluster.points)
    cluster.centroid.x = ((n-1)*cluster.centroid.x + cluster.points[n-1].x)/n
    cluster.centroid.y = ((n-1)*cluster.centroid.y + cluster.points[n-1].y)/n

这使得计算时间（对于整个k意味着算法）降低到原始的13％。 =）

谢谢大家的一些见解！

Answer 7

试试这个：

def updateCentroid(self, label):

    self.clusters[label].centroid.x = numpy.array([point.x for point in self.clusters[label].points]).mean()
    self.clusters[label].centroid.y = numpy.array([point.y for point in self.clusters[label].points]).mean()

Answer 8

分析器只会告诉您有关功能的问题。 This is the method I use，它确定了代价高昂的代码行，包括调用函数的点。

尽管如此，人们普遍认为数据结构是免费的。正如@ Michael-Anderson所问，为什么不避免制作阵列？这是我在你的代码中看到的第一件事，你通过追加来构建数组。你不需要。

Answer 9

一种方法是将x_sum和y_sum添加到“clusters”对象中，并在添加点时对坐标求和。如果事情在四处移动，您也可以在点移动时更新总和。然后获取质心只是将x_sum和y_sum除以点数。如果你的点是可以添加的numpy向量，那么你甚至不需要对组件求和，只需保持所有向量的总和，并在末尾乘以1 / len。

在python中优化均值

9 个答案: