Question

Python中most_common对象提供的函数collections.Counter的复杂性是什么？

更具体地说，Counter在计算时保留某种排序列表，允许它在most_common O(n)时执行n操作的速度快于{{1}}是添加到柜台的（唯一）项目的数量？为了您的信息，我正在处理一些大量的文本数据，试图找到第n个最常见的令牌。

我检查了CPython wiki上的official documentation和TimeComplexity article，但我找不到答案。

Answer 1

从collections.py的源代码中，我们看到如果我们没有指定多个返回的元素，most_common将返回计数的排序列表。这是O(n log n)算法。

如果我们使用most_common返回k > 1个元素，那么我们会使用heapq.nlargest。这是一个O(k) + O((n - k) log k) + O(k log k)算法，对于小常数k非常有用，因为它基本上是线性的。 O(k)部分来自堆积初始k计数，第二部分来自n - k调用heappushpop方法，第三部分来自排序k的最终堆元素。自k <= n以来我们可以得出结论：复杂性是：

O（n log k）

如果k = 1那么很容易证明复杂性是：

O（n）的

Answer 2

The source确切地说明了会发生什么：

def most_common(self, n=None):
    '''List the n most common elements and their counts from the most
    common to the least.  If n is None, then list all element counts.

    >>> Counter('abracadabra').most_common(3)
    [('a', 5), ('r', 2), ('b', 2)]

    '''
    # Emulate Bag.sortedByCount from Smalltalk
    if n is None:
        return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
    return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))

heapq.nlargest在heapq.py

中定义

Python collections.Counter：most_common复杂性

2 个答案: