基于值有效地跟踪字典的前k个键

时间:2013-03-15 06:50:20

标签: python sorting dictionary scalability

当字典更新密钥时,如何有效地跟踪>具有最大值的字典的前k个键?

我尝试过每次更新后从字典中创建排序列表的天真方法(如Getting key with maximum value in dictionary?中所述),但这非常昂贵且无法扩展。< / p>

真实世界的例子:

计算来自无限数据流的词频。在任何给定时刻,可能会要求程序报告单词是否在当前 top-k最常见的值中。我们如何实现有效

collections.Counter太慢了

>>> from itertools import permutations
>>> from collections import Counter
>>> from timeit import timeit
>>> c = Counter()
>>> for x in permutations(xrange(10), 10):
    c[x] += 1


>>> timeit('c.most_common(1)', 'from __main__ import c', number=1)
0.7442058258093311
>>> sum(c.values())
3628800

计算此值需要将近一秒钟!

我正在寻找most_common()函数的O(1)时间。这应该可以通过另一个只在内部存储当前top-k项的数据结构来实现,并跟踪当前的最小值。

3 个答案:

答案 0 :(得分:2)

collections.Counter.most_common does a pass over all the values, finding the N-th largest one by putting them in a heap as it goes(在我看来,O(M log N)时间,其中M是字典元素的总数)。

正如Wei Yen在评论中所建议的那样

heapq可能正常工作:与字典并行,保持N个最大值的heapq,并在修改dict时检查值是否正确是在那里或应该在那里。问题在于,正如您所指出的那样,界面实际上没有任何方法可以修改已经存在的“优先级”(在您的情况下,[负数,因为它是最小堆数]计数)现有元素。

您可以就地修改相关项目,然后运行heapq.heapify以恢复堆积。这需要在堆的大小(N)中进行线性传递以找到相关项(除非您正在执行额外的簿记以将元素与位置关联;可能不值得),以及另一个线性传递以重新堆积。如果元素不在列表中并且现在是,则需要通过替换最小元素将其添加到堆中(在线性时间内,禁止一些其他结构)。

但是,heapq私有接口包含一个函数_siftdown,它具有以下注释:

# 'heap' is a heap at all indices >= startpos, except possibly for pos.  pos
# is the index of a leaf with a possibly out-of-order value.  Restore the
# heap invariant.

听起来不错!调用heapq._siftdown(heap, 0, pos_of_relevant_idx)将在log N时间内修复堆。当然,你必须找到你首先递增的索引的位置,这需要线性时间。您可以维护索引的元素字典以避免这种情况(也保持指向最小元素位置的指针),但是您必须复制_siftdown的源并修改它以更新它交换东西时的字典,或者之后做一个线性时间传递来重建字典(但你只是试图避免线性传递...)。

小心,这应该是O(log N)时间。事实证明,有一种称为Fibonacci heap的东西支持所需的所有操作,在(摊销)常量时间内。不幸的是,这是大O不是整个故事的案例之一; Fibonacci堆的复杂性意味着在实践中,除了非常大的堆之外,它们实际上并不比二进制堆更快。另外(也许是“因此”),我在快速谷歌搜索中找不到标准的Python实现,尽管Boost C ++库确实包含一个。

我首先尝试使用heapq,对要更改的元素执行线性搜索,然后调用_siftdown;与Counter方法的O(M log N)相比,这是O(N)时间。如果结果太慢,您可以维护附加的索引字典并创建自己的_siftdown版本来更新dict,这应该会结束O(log N)时间。如果仍然太慢(我怀疑),你可以寻找Boost的Fibonacci堆(或其他实现)的Python包装器,但我真的怀疑这将是值得的麻烦。< / p>

答案 1 :(得分:1)

使用collections.Counter它已经为那个真实世界的例子做了。你有其他用例吗?

答案 2 :(得分:0)

我们可以实现一个跟踪top-k值的类,因为我不相信标准库有这个内置的。这将与主词典对象(可能是Counter)保持最新并行。您也可以将其用作主词典对象的子类的属性。

实施

class MostCommon(object):
    """Keep track the top-k key-value pairs.

    Attributes:
        top: Integer representing the top-k items to keep track of.
        store: Dictionary of the top-k items.
        min: The current minimum of any top-k item.
        min_set: Set where keys are counts, and values are the set of
            keys with that count.
    """
    def __init__(self, top):
        """Create a new MostCommon object to track key-value paris.

        Args:
            top: Integer representing the top-k values to keep track of.
        """
        self.top = top
        self.store = dict()
        self.min = None
        self.min_set = defaultdict(set)

    def _update_existing(self, key, value):
        """Update an item that is already one of the top-k values."""
        # Currently handle values that are non-decreasing.
        assert value > self.store[key]
        self.min_set[self.store[key]].remove(key)
        if self.store[key] == self.min:  # Previously was the minimum.
            if not self.min_set[self.store[key]]:  # No more minimums.
                del self.min_set[self.store[key]]
                self.min_set[value].add(key)
                self.min = min(self.min_set.keys())
        self.min_set[value].add(key)
        self.store[key] = value

    def __contains__(self, key):
        """Boolean if the key is one of the top-k items."""
        return key in self.store

    def __setitem__(self, key, value):
        """Assign a value to a key.

        The item won't be stored if it is less than the minimum (and
        the store is already full). If the item is already in the store,
        the value will be updated along with the `min` if necessary.
        """
        # Store it if we aren't full yet.
        if len(self.store) < self.top:
            if key in self.store:  # We already have this item.
                self._update_existing(key, value)
            else:  # Brand new item.
                self.store[key] = value
                self.min_set[value].add(key)
                if value < self.min or self.min is None:
                    self.min = value
        else:  # We're full. The value must be greater minimum to be added.
            if value > self.min:  # New item must be larger than current min.
                if key in self.store:  # We already have this item.
                    self._update_existing(key, value)
                else:  # Brand new item.
                    # Make room by removing one of the current minimums.
                    old = self.min_set[self.min].pop()
                    del self.store[old]
                    # Delete the set if there are no old minimums left.
                    if not self.min_set[self.min]:
                        del self.min_set[self.min]
                    # Add the new item.
                    self.min_set[value].add(key)
                    self.store[key] = value
                    self.min = min(self.min_set.keys())

    def __repr__(self):
        if len(self.store) < 10:
            store = repr(self.store)
        else:
            length = len(self.store)
            largest = max(self.store.itervalues())
            store = '<len={length}, max={largest}>'.format(length=length,
                                                           largest=largest)
        return ('{self.__class__.__name__}(top={self.top}, min={self.min}, '
                'store={store})'.format(self=self, store=store))

使用示例

>>> common = MostCommon(2)
>>> common
MostCommon(top=2, min=None, store={})
>>> common['a'] = 1
>>> common
MostCommon(top=2, min=1, store={'a': 1})
>>> 'a' in common
True
>>> common['b'] = 2
>>> common
MostCommon(top=2, min=1, store={'a': 1, 'b': 2})
>>> common['c'] = 3
>>> common
MostCommon(top=2, min=2, store={'c': 3, 'b': 2})
>>> 'a' in common
False
>>> common['b'] = 4
>>> common
MostCommon(top=2, min=3, store={'c': 3, 'b': 4})

更新值后的访问确实是O(1)

>>> counter = Counter()
>>> for x in permutations(xrange(10), 10):
        counter[x] += 1

>>> common = MostCommon(1)
>>> for key, value in counter.iteritems():
    common[key] = value

>>> common
MostCommon(top=1, min=1, store={(9, 7, 8, 0, 2, 6, 5, 4, 3, 1): 1})
>>> timeit('repr(common)', 'from __main__ import common', number=1)
1.3251570635475218e-05

访问是O(1),但是当设置项调用期间的最小值更改为O(n)操作时,其中n是最高值的数量。这仍然比Counter好,在每次访问时都是O(n),其中n是整个词汇的大小!