Question

最近我一直在玩Python的collections.Counter数据结构。该对象的标准用法是计算文本文件中单词的出现次数：

from collections import Counter

with open(r'filename') as f:
    #create a list of all words in the file using a list comprehension
    words = [word for line in f for word in line.split()]

c = Counter(words)

很酷的部分是如何使用这种结构来确定哪些词是最常见的：

for word, count in c.most_common():
    print word, count

我不理解的部分是most_common() runs in O(n) time [编辑：这是不正确的。根据Martijn的答案，它实际上在O（n log k）]中运行。显然这意味着它不能在幕后用dict进行比较排序，因为最快的比较排序是O（nlogn）。

那么collections.Counter如何实现快速排序时间？

Answer 1

在O（n）时间内不运行。当您在字典中要求所有值时，会使用常规排序，即O（NlogN）算法。

当要求获得前K个结果时，使用heapq.nlargest()调用，这是O（NlogK）时间内更有效的方法：

def most_common(self, n=None):
    '''List the n most common elements and their counts from the most
    common to the least.  If n is None, then list all element counts.

    >>> Counter('abcdeabcdabcaba').most_common(3)
    [('a', 5), ('b', 4), ('c', 3)]

    '''
    # Emulate Bag.sortedByCount from Smalltalk
    if n is None:
        return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
    return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))

答案谈到计算是在线性时间内完成的;构造Counter实例，基本上是输入可迭代的循环：

for elem in iterable:
    self[elem] = self_get(elem, 0) + 1

Answer 2

排序不是在线性时间内运行的部分。这需要O（nlog（k）），其中n是计数器中的项目数，k是您从most_common请求的项目数。计算需要线性时间。它基本上是这样做的：

for item in iterable:
    self[item] = self.get(item, 0) + 1

collections.Counter如何实现快速排序时间？

2 个答案: