Question

我在Huffman树上遇到搜索算法的问题：对于给定的概率分布，无论输入数据的排列如何，我都需要Huffman树是相同的。

这是一张与我想要的相关的图片：

Expectation vs reality

基本上我想知道是否可以保留列表中项目与树的相对顺序。如果没有，为什么会这样呢？

作为参考，我使用Huffman树根据概率的划分生成子组，以便我可以运行下面的search（）过程。请注意，merge（）子例程中的数据与权重一起组合。代码字本身并不像树那样重要（它应该保留相对顺序）。

例如，如果我生成以下霍夫曼代码：

probabilities = [0.30, 0.25, 0.20, 0.15, 0.10]
items = ['a','b','c','d','e']
items = zip(items, probabilities)
t = encode(items)
d,l = hi.search(t)
print(d)

使用以下类：

class Node(object):
    left = None
    right = None
    weight = None
    data = None
    code = None

    def __init__(self, w,d):
        self.weight = w
        self.data = d

    def set_children(self, ln, rn):
        self.left = ln
        self.right = rn

    def __repr__(self):
        return "[%s,%s,(%s),(%s)]" %(self.data,self.code,self.left,self.right)

    def __cmp__(self, a):
        return cmp(self.weight, a.weight)

    def merge(self, other):
        total_freq = self.weight + other.weight
        new_data = self.data + other.data
        return Node(total_freq,new_data)

    def index(self, node):
        return node.weight

def encode(symbfreq):
    pdb.set_trace()
    tree = [Node(sym,wt) for wt,sym in symbfreq]
    heapify(tree)
    while len(tree)>1:
        lo, hi = heappop(tree), heappop(tree)
        n = lo.merge(hi)
        n.set_children(lo, hi)
        heappush(tree, n)
    tree = tree[0]

    def assign_code(node, code):
        if node is not None:
            node.code = code
        if isinstance(node, Node):
            assign_code(node.left, code+'0')
            assign_code(node.right, code+'1')

    assign_code(tree, '')
    return tree

我明白了：

'a'->11
'b'->01
'c'->00
'd'->101
'e'->100

然而，我在搜索算法中做出的一个假设是，更多可能的项目被推向左边：即我需要'a'来获得'00'代码字 - 这应该始终是这样的，无论如何'abcde'序列的任何排列。示例输出是：

codewords = {'a':'00', 'b':'01', 'c':'10', 'd':'110', 'e':111'}

（N.b即使'c'的代码字是'd'的后缀，这也没问题）。

为了完整性，这里是搜索算法：

def search(tree):
    print(tree)
    pdb.set_trace()
    current = tree.left
    other = tree.right
    loops = 0
    while current:
        loops+=1
        print(current)
        if current.data != 0 and current is not None and other is not None:
            previous = current
            current = current.left
            other = previous.right
        else:
            previous = other
            current = other.left
            other = other.right
    return previous, loops

它的工作原理是搜索0和1组中的“最左边”1 - 霍夫曼树必须在左侧放置更多可能的项目。例如，如果我使用上面的概率和输入：

items = [1,0,1,0,0]

然后算法返回的项目的索引是2 - 这不应该返回（0应该，因为它是最左边的）。

Answer 1

通常的做法是仅使用Huffman算法生成代码长度。然后使用规范过程从长度生成代码。树被丢弃了。代码按照从较短代码到较长代码的顺序分配，并且在代码内对符号进行分类。这会提供您期望的代码，a = 00，b = 01等。这称为Canonical Huffman code。

这样做的主要原因是使霍夫曼码的传输更紧凑。您只需要为每个符号发送代码长度，而不是将每个符号的代码与压缩数据一起发送。然后可以在另一端重建代码以进行解压缩。

霍夫曼树通常也不用于解码。使用规范代码，通过简单的比较来确定下一个代码的长度，使用代码值的索引将直接转到符号。或者表驱动的方法可以避免搜索长度。

对于你的树，当频率相等时，会有任意选择。特别地，在第二步骤中，第一节点被拉出c，概率为0.2，第二节点被拉出b，概率为0.25。但是，在第一步中创建的节点b，而不是(e,d)，其概率也 0.25，同样有效。事实上，这就是你想要的最终状态。唉，你放弃了对heapq库的任意选择的控制。

（注意：由于你使用浮点值，0.1 + 0.15不一定完全等于0.25。虽然事实证明它是。另一个例子，0.1 + 0.2 不等于0.3。如果你想看看当频率之和等于其他频率或频率之和时会发生什么，你最好使用整数作为频率。例如6,5,4,3,2。）

通过修正一些错误可以解决一些错误的排序：将lo.merge(high)更改为hi.merge(lo)，并将位的顺序反转为：assign_code(node.left, code+'1')后跟assign_code(node.right, code+'0') 。然后至少为a分配00，d在e之前，b在c之前。然后排序为adebc。

现在我考虑一下，即使您选择(e,d)超过b，例如将b的概率设置为0.251，您仍然无法获得完整的订单你在追求。无论如何，(e,d)（0.25）的概率大于c（0.2）的概率。因此，即使在这种情况下，最终排序将是（上面的修正）abdec而不是您想要的abcde。所以不可能得到你想要的东西，假设一致的树排序和关于符号组概率的比特分配。例如，假设对于每个分支，左边的东西具有比右边的东西更大或相等的概率，并且0总是分配给左边，1总是分配给右边。你需要做一些不同的事情。

我想到的不同之处就是我在回答的开头所说的。使用Huffman算法只是为了得到代码长度。然后，您可以按照您喜欢的顺序将代码分配给符号，并构建新树。这比试图提出某种方案来强制原始树成为你想要的，并证明它在所有情况下都有效会容易得多。

Answer 2

我将充实马克·阿德勒对工作代码的评价。他说的一切都是对的:-)高点：

您不得使用浮点权重或任何其他丢失权重信息的方案。使用整数。简单而正确。例如，如果您有3位浮动概率，则通过int(round(the_probability * 1000))将每个转换为整数，然后可以调整它们以确保总和恰好为1000.
heapq堆不是“稳定的”：如果多个项目具有相同的最小权重，则不会定义关于弹出哪个项目。
因此，在构建树时，您无法获得所需的。

“规范霍夫曼代码”的一小部分似乎是您做想要的。为此构建一个树是一个冗长的过程，但每个步骤都足够简单。构建的第一个树被丢弃：从中获取的唯一信息是分配给每个符号的代码的长度。

运行：

syms = ['a','b','c','d','e'] weights = [30, 25, 20, 15, 10] t = encode(syms, weights) print t

打印出来（为便于阅读而格式化）：

[abcde,, ([ab,0, ([a,00,(None),(None)]), ([b,01,(None),(None)])]), ([cde,1, ([c,10,(None),(None)]), ([de,11, ([d,110,(None),(None)]), ([e,111,(None),(None)])])])]

我最了解，这正是你想要的。如果不是，则投诉; - ）

编辑：规范代码的分配存在错误，除非权重非常不同，否则不会显示错误。修好了。

class Node(object): def __init__(self, data=None, weight=None, left=None, right=None, code=None): self.data = data self.weight = weight self.left = left self.right = right self.code = code def is_symbol(self): return self.left is self.right is None def __repr__(self): return "[%s,%s,(%s),(%s)]" % (self.data, self.code, self.left, self.right) def __cmp__(self, a): return cmp(self.weight, a.weight) def encode(syms, weights): from heapq import heapify, heappush, heappop tree = [Node(data=s, weight=w) for s, w in zip(syms, weights)] sym2node = {s.data: s for s in tree} heapify(tree) while len(tree) > 1: a, b = heappop(tree), heappop(tree) heappush(tree, Node(weight=a.weight + b.weight, left=a, right=b)) # Compute code lengths for the canonical coding. sym2len = {} def assign_codelen(node, codelen): if node is not None: if node.is_symbol(): sym2len[node.data] = codelen else: assign_codelen(node.left, codelen + 1) assign_codelen(node.right, codelen + 1) assign_codelen(tree[0], 0) # Create canonical codes, but with a twist: instead # of ordering symbols alphabetically, order them by # their position in the `syms` list. # Construct a list of (codelen, index, symbol) triples. # `index` breaks ties so that symbols with the same # code length retain their original ordering. triples = [(sym2len[name], i, name) for i, name in enumerate(syms)] code = oldcodelen = 0 for codelen, _, name in sorted(triples): if codelen > oldcodelen: code <<= (codelen - oldcodelen) sym2node[name].code = format(code, "0%db" % codelen) code += 1 oldcodelen = codelen # Create a tree corresponding to the new codes. tree = Node(code="") dir2attr = {"0": "left", "1": "right"} for snode in sym2node.values(): scode = snode.code codesofar = "" parent = tree # Walk the tree creating any needed interior nodes. for d in scode: assert parent is not None codesofar += d attr = dir2attr[d] child = getattr(parent, attr) if codesofar == scode: # We're at the leaf position. assert child is None setattr(parent, attr, snode) elif child is not None: assert child.code == codesofar else: child = Node(code=codesofar) setattr(parent, attr, child) parent = child # Finally, paste the `data` attributes together up # the tree. Why? Don't know ;-) def paste(node): if node is None: return "" elif node.is_symbol(): return node.data else: result = paste(node.left) + paste(node.right) node.data = result return result paste(tree) return tree

重复符号

我可以将sym2node dict交换到ordereddict来处理重复'a'/'b等？

不，原因有两个：

没有映射类型支持重复键;和，

“重复符号”的概念对于霍夫曼编码毫无意义。

所以，如果你确定;-)要追求这个，首先你必须确保符号是唯一的。只需在函数开头添加此行：

syms = list(enumerate(syms))

例如，如果传入的syms是：

['a', 'b', 'a']

将改为：

[(0, 'a'), (1, 'b'), (2, 'a')]

所有符号现在都是2元组，并且显然是唯一的，因为每个符号都以唯一的整数开头。算法唯一关心的是符号可以用作dict键;它不关心它们是字符串，元组还是支持相等测试的任何其他类型。

因此算法中没有任何内容需要改变。但在结束之前，我们将要恢复原始符号。只需在paste()函数之前插入它：

def restore_syms(node): if node is None: return elif node.is_symbol(): node.data = node.data[1] else: restore_syms(node.left) restore_syms(node.right) restore_syms(tree)

只需遍历树并从符号'.data成员中删除前导整数。或者，也许更简单，只需迭代sym2node.values()，并转换每个.data成员。

将列表映射到霍夫曼树，同时保留相对顺序

2 个答案:

重复符号