Question

我最近一直在阅读各种hierarchical clustering algorithms，例如single-linkage clustering和group average clustering。通常，这些算法不易扩展。大多数层次聚类算法的朴素实现是O(N^3)，但单链接聚类可以在O(N^2)时间内实现。

还声称群组平均群集可以在O(N^2 logN)时间内实施。这就是我的问题所在。

我根本看不出这是怎么回事。

解释后的解释，如：

http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html

http://nlp.stanford.edu/IR-book/completelink.html#averagesection

https://en.wikipedia.org/wiki/UPGMA#Time_complexity

...声称可以使用优先级队列在O(N^2 logN)时间内完成组平均层次聚类。但是当我阅读实际的解释或伪代码时，我觉得它总是比O(N^3)更好。

基本上，算法如下：

For an input sequence of size N:

Create a distance matrix of NxN #(this is O(N^2) time)
For each row in the distance matrix:
   Create a priority queue (binary heap) of all distances in the row

Then:

For i in 0 to N-1:
  Find the min element among all N priority queues # O(N)
  Let k = the row index of the min element

  For each element e in the kth row:
    Merge the min element with it's nearest neighbor
    Update the corresponding values in the distance matrix
    Update the corresponding value in priority_queue[e]

所以这就是最后步骤，对我来说，这似乎是一个O(N^3)算法。假设优先级队列是二进制堆，则无法在O(N)时间内“更新”优先级队列中的任意值而不扫描队列。（二进制堆使您可以持续访问min元素和log N插入/删除，但您不能简单地按值O(N)时间查找元素。由于我们扫描每个行元素的优先级队列，因此对于每一行，我们得到(O(N^3))。

优先级队列按距离值排序 - 但是所讨论的算法要求删除优先级队列中与k对应的元素，距离矩阵中的行索引最小元素。同样，如果没有O(N)扫描，则无法在队列中找到此元素。

所以，我认为我可能错了，因为其他人都不这么说。有人可以解释这个算法是如何不 O(N^3)，但事实上，O(N^2 logN)？

Answer 1

I think you are saying that the problem is that in order to update an entry in a heap you have to find it, and finding it takes time O(N). What you can do to get round this is to maintain an index that gives, for each item i, its location heapPos[i] in the heap. Every time you swap two items to restore the heap invariant you then need to modify two entries in heapPos[i] to keep the index correct, but this is just a constant factor on the work done in the heap.

Answer 2

如果将位置存储在堆中（这会添加另一个O（n）内存），则只能在更改的位置上更新堆而不进行扫描。这些更新仅限于堆上的两个路径（一个删除，一个更新），并在O（log n）中执行。或者，您可以按旧优先级进行二进制搜索，这可能也在O（log n）中（但速度较慢，上面的方法是O（1））。

所以恕我直言，你确实可以在O（n ^ 2 log n）中实现这些。但是实现仍将使用很多（O（n ^ 2））内存，O（n ^ 2）的任何内容都不缩放。你通常如果你有O（n ^ 2）内存......

，在你没时间用完之前内存不足

实现这些数据结构非常棘手。如果做得不好，这可能最终会慢于理论上更糟糕的方法。例如斐波纳契堆。它们在纸上具有很好的性能，但却有太高的固定成本才能获得回报。

Answer 3

不，因为距离矩阵是对称的。

如果第0行中的第一个条目是第5列，距离为1，并且系统中最低，那么第5行中的第一个条目必须是第0列的补充条目，距离为1。 / p>

实际上你只需要一个半矩阵。

群平均聚类的算法复杂度

3 个答案: