Question

我不知道如何在不使用_siftup或_siftdown的情况下有效解决以下问题：

当一个元素发生故障时，如何恢复堆不变式？

换句话说，将old_value中的heap更新为new_value，并使heap正常工作。您可以假设堆中只有一个old_value。功能定义如下：

def update_value_in_heap(heap, old_value, new_value):

这是我的真实情况，如果您有兴趣请阅读。

您可以想象这是一个小的自动完成系统。我要数单词的频率，并保持前k个最大数量的单词，其中准备随时输出。所以我在这里使用heap。当一个字 count ++，如果它在堆中，我需要对其进行更新。
所有单词和计数都存储在trie-tree的叶子和堆中
存储在trie-tree的中间节点中。如果你在乎这个词
堆满了，不用担心，我可以从trie-tree的叶子节点上获取它。
当用户键入单词时，它将首先从堆中读取，然后更新
它。为了获得更好的性能，我们可以考虑减少更新频率通过批量更新。

那么当一个特定的字数增加时，如何更新堆？

这是_siftup或_siftdown版本的简单示例（不是我的情况）：

>>> from heapq import _siftup, _siftdown, heapify, heappop

>>> data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]
>>> heapify(data)
>>> old, new = 8, 22              # increase the 8 to 22
>>> i = data.index(old)
>>> data[i] = new
>>> _siftup(data, i)
>>> [heappop(data) for i in range(len(data))]
[1, 2, 3, 5, 7, 10, 18, 19, 22, 37]

>>> data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]
>>> heapify(data)
>>> old, new = 8, 4              # decrease the 8 to 4
>>> i = data.index(old)
>>> data[i] = new
>>> _siftdown(data, 0, i)
>>> [heappop(data) for i in range(len(data))]
[1, 2, 3, 4, 5, 7, 10, 18, 19, 37]

花费O（n）进行索引和O（logn）进行更新。 heapify是另一种解决方案，但是比_siftup或_siftdown效率低。

但是_siftup和_siftdown是heapq中的受保护成员，因此不建议从外部进行访问。

那么，有没有更好，更有效的方法来解决此问题？这种情况的最佳做法？

感谢您的阅读，我非常感谢它对我的帮助。：）

已经参考了heapq python - how to modify values for which heap is sorted，但没有解决我的问题

Answer 1

您必须牢记的重要一件事是，理论复杂性和性能是两个不同的事物（即使它们是相关的）。换句话说，实现也很重要。渐近复杂度为您提供了一些下界，您可以将它们视为保证，例如，O（n）中的算法可确保在最坏的情况下，您将执行一些在输入大小。这里有两个重要的事情：1）常量被忽略（常量在现实生活中很重要），2）最坏的情况取决于您不仅考虑输入而考虑的算法。请注意，根据发现复杂性的位置，观察1）可能非常重要。在某些域中，隐藏在渐近复杂性中的常量是如此之大，以至于您无法建立输入大小大于“常量”的情况。这里不是这种情况，但是您始终必须记住这一点。

给出这两个观察结果，您不能真正说“实现A比B快，因为A源自O（n）算法，而B源自O（log n）算法”。即使总的来说，这是一个很好的论据，但并不总是足够的。

如果您知道用例将是什么，则可以直接测试性能。同时使用测试和渐进复杂性，可以使您很好地了解算法的性能（在极端情况和实际情况下）。

话虽如此，让我们在以下将实现three different strategies的类上运行一些性能测试（实际上这里有四种策略，但是 Invalidate and Reinsert 在您的看来并不正确）情况，因为您会在看到给定字词的情况下使每个项目无效的次数。我将包含大多数代码，以便您可以再次检查自己是否搞砸了（甚至可以检查complete notebook）：

from heapq import _siftup, _siftdown, heapify, heappop

class Heap(list):
  def __init__(self, values, sort=False, heap=False):
    super().__init__(values)
    heapify(self)
    self._broken = False
    self.sort = sort
    self.heap = heap or not sort

  # Solution 1) repair using the knowledge we have after every update:        
  def update(self, key, value):
    old, self[key] = self[key], value
    if value > old:
        _siftup(self, key)
    else:
        _siftdown(self, 0, key)

  # Solution 2 and 3) repair using sort/heapify in a lazzy way:
  def __setitem__(self, key, value):
    super().__setitem__(key, value)
    self._broken = True

  def __getitem__(self, key):
    if self._broken:
        self._repair()
        self._broken = False
    return super().__getitem__(key)

  def _repair(self):  
    if self.sort:
        self.sort()
    elif self.heap:
        heapify(self)

  # … you'll also need to delegate all other heap functions, for example:
  def pop(self):
    self._repair()
    return heappop(self)

我们首先要检查这三种方法是否都可以工作：

data = [10, 5, 18, 2, 37, 3, 8, 7, 19, 1]

heap = Heap(data[:])
heap.update(8, 22)
heap.update(7, 4)
print(heap.nlargest(len(data)))

heap = Heap(data[:], sort_fix=True)
heap[8] = 22
heap[7] = 4
print(heap.nlargest(len(data)))

heap = Heap(data[:], heap_fix=True)
heap[8] = 22
heap[7] = 4
print(heap.nlargest(len(data)))

然后，我们可以使用以下功能运行一些性能测试：

import time
import random

def rand_update(heap, lazzy_fix=False, **kwargs):
    index = random.randint(0, len(heap)-1)
    new_value = random.randint(max_int+1, max_int*2)
    if lazzy_fix:
        heap[index] = new_value
    else:
        heap.update(index, new_value)

def rand_updates(n, heap, lazzy_fix=False, **kwargs):
    for _ in range(n):
        rand_update(heap, lazzy_fix)

def run_perf_test(n, data, **kwargs):
    test_heap = Heap(data[:], **kwargs)
    t0 = time.time()
    rand_updates(n, test_heap, **kwargs)
    test_heap[0]
    return (time.time() - t0)*1e3

results = []
max_int = 500
nb_updates = 1

for i in range(3, 7):
    test_size = 10**i
    test_data = [random.randint(0, max_int) for _ in range(test_size)]

    perf = run_perf_test(nb_updates, test_data)
    results.append((test_size, "update", perf))

    perf = run_perf_test(nb_updates, test_data, lazzy_fix=True, heap_fix=True)
    results.append((test_size, "heapify", perf))

    perf = run_perf_test(nb_updates, test_data, lazzy_fix=True, sort_fix=True)
    results.append((test_size, "sort", perf))

结果如下：

import pandas as pd
import seaborn as sns

dtf = pd.DataFrame(results, columns=["heap size", "method", "duration (ms)"])
print(dtf)

sns.lineplot(
    data=dtf, 
    x="heap size", 
    y="duration (ms)", 
    hue="method",
)

从这些测试中，我们可以看到heapify似乎是最合理的选择，在最坏的情况下它具有相当好的复杂度：O（n），并且在实践中表现更好。另一方面，研究其他选项（例如具有专用于该特定问题的数据结构，例如，使用bin将单词放入其中，然后将它们从bin移至下一个看起来像是一条可能的轨迹）可能是个好主意。调查）。

重要说明：这种情况（更新与阅读比率为1：1）对于heapify和sort解决方案都是不利的。因此，如果您设法使比率为k：1，则该结论将更加清楚（您可以在上面的代码中将nb_updates = 1替换为nb_updates = k）。

数据框详细信息：

    heap size   method  duration in ms
0        1000   update        0.435114
1        1000  heapify        0.073195
2        1000     sort        0.101089
3       10000   update        1.668930
4       10000  heapify        0.480175
5       10000     sort        1.151085
6      100000   update       13.194084
7      100000  heapify        4.875898
8      100000     sort       11.922121
9     1000000   update      153.587103
10    1000000  heapify       51.237106
11    1000000     sort      145.306110

Answer 2

@cglacet的答案是完全错误的，但是看起来很合法。他提供的代码段已完全损坏！这也很难读。 _siftup()在heapify()中被称为n // 2次，因此它本身不能比_siftup()快。

要回答原始问题，没有更好的方法。如果您担心方法的私有性，请创建自己的方法来做同样的事情。

我唯一同意的是，如果长时间不需要读取堆，则可能有益于在需要时懒惰heapify()他们。问题是您是否应该为此使用堆。

让我们看看他的摘录中的问题：

heapify()函数在“更新”运行中被多次调用。导致这种情况的错误链如下：

他通过了heap_fix，但期望heap，sort也是这样
如果self.sort始终为False，则self.heap始终为True
他重新定义__getitem__()的{{1}}每次分配或读取内容时都会调用的__setitem__()和_siftup()（注意：这两个在C中没有被调用，因此它们使用_siftdown()和__getitem__()）
如果__setitem__()是self.heap且正在调用True和__getitem__()，则每次__setitem__()或{{ 1}}交换元素。但是对_repair()的调用是在C语言中完成的，因此_siftup()不会被调用，并且不会以无限循环结束
他重新定义了siftdown()，因此像他尝试那样调用它会失败
他阅读了一次，但更新了一个项目heapify()次，而不是像他所说的那样1：1。

我修复了示例，尝试尽最大可能对其进行验证，但是我们所有人都会出错。随时自己检查。

代码

__getitem__()

结果

如您所见，使用self.sort和nb_updates的“ update”方法渐近地更快。

您应该知道代码的作用以及运行时间。如有疑问，应检查一下。 @cglaced检查了执行需要花费多长时间，但他没有怀疑执行需要多长时间。如果他做到了，他会发现两者不匹配。还有其他人为此而屈服。

import time
import random

from heapq import _siftup, _siftdown, heapify, heappop

class UpdateHeap(list):
    def __init__(self, values):
        super().__init__(values)
        heapify(self)

    def update(self, index, value):
        old, self[index] = self[index], value
        if value > old:
            _siftup(self, index)
        else:
            _siftdown(self, 0, index)

    def pop(self):
        return heappop(self)

class SlowHeap(list):
    def __init__(self, values):
        super().__init__(values)
        heapify(self)
        self._broken = False
        
    # Solution 2 and 3) repair using sort/heapify in a lazy way:
    def update(self, index, value):
        super().__setitem__(index, value)
        self._broken = True
    
    def __getitem__(self, index):
        if self._broken:
            self._repair()
            self._broken = False
        return super().__getitem__(index)

    def _repair(self):
        ...

    def pop(self):
        if self._broken:
            self._repair()
        return heappop(self)

class HeapifyHeap(SlowHeap):

    def _repair(self):
        heapify(self)


class SortHeap(SlowHeap):

    def _repair(self):
        self.sort()

def rand_update(heap):
    index = random.randint(0, len(heap)-1)
    new_value = random.randint(max_int+1, max_int*2)
    heap.update(index, new_value)
    
def rand_updates(update_count, heap):
    for i in range(update_count):
        rand_update(heap)
        heap[0]
        
def verify(heap):
    last = None
    while heap:
        item = heap.pop()
        if last is not None and item < last:
            raise RuntimeError(f"{item} was smaller than last {last}")
        last = item

def run_perf_test(update_count, data, heap_class):
    test_heap = heap_class(data)
    t0 = time.time()
    rand_updates(update_count, test_heap)
    perf = (time.time() - t0)*1e3
    verify(test_heap)
    return perf


results = []
max_int = 500
update_count = 100

for i in range(2, 7):
    test_size = 10**i
    test_data = [random.randint(0, max_int) for _ in range(test_size)]

    perf = run_perf_test(update_count, test_data, UpdateHeap)
    results.append((test_size, "update", perf))
    
    perf = run_perf_test(update_count, test_data, HeapifyHeap)
    results.append((test_size, "heapify", perf))

    perf = run_perf_test(update_count, test_data, SortHeap)
    results.append((test_size, "sort", perf))

import pandas as pd
import seaborn as sns

dtf = pd.DataFrame(results, columns=["heap size", "method", "duration (ms)"])
print(dtf)

sns.lineplot(
    data=dtf, 
    x="heap size", 
    y="duration (ms)", 
    hue="method",
)

如何避免在heapq中使用_siftup或_siftdown

2 个答案:

代码

结果