如何从频率字典创建二叉树

时间:2019-03-07 20:49:17

标签: python huffman-code

我对编码还很陌生,我很难创建一种霍夫曼算法来编码和解码文本文件。我对大多数概念都了解得很好,但是关于如何创建和遍历树并没有太多的了解。

到目前为止,这是我的代码:

'Counts'

3 个答案:

答案 0 :(得分:1)

的修改版本:https://rosettacode.org/wiki/Huffman_coding#Python

这是用于'txt'中任何消息的霍夫曼编码器/解码器

这会将txt消息编码为一个缩短的二进制变量,以供存储(您可以将compressed_binary存储到磁盘。还可以使用decompressHuffmanCode对其进行解码,后者会从compressed_binary的压缩字符串中重新创建原始字符串

from heapq import heappush, heappop, heapify
from collections import defaultdict
from functools import reduce

def encode(symb2freq):
    heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()]
    heapify(heap)
    while len(heap) > 1:
        lo = heappop(heap)
        hi = heappop(heap)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
    return dict(sorted(heappop(heap)[1:], key=lambda p: (p, len(p[-1]))))

# recreates the original message from your huffman code table 
# uncomment print(a) to see how it works
def decompressHuffmanCode(a, bit):
    # print(a)
    return ('', a[1] + s[a[0]+bit[0]]) if (a[0]+bit[0] in s) else (a[0]+bit[0], a[1])

txt="CompresssionIsCoolWithHuffman"

# Create symbol to frequency table
symb2freq = defaultdict(int)
for ch in txt:
    symb2freq[ch] += 1
enstr=encode(symb2freq)

# Create Huffman code table from frequency table
s=dict((v,k) for k,v in dict(enstr).items())

# Create compressible binary. We add 1 to the front, and remove it when read from disk
compressed_binary = '1' + ''.join([enstr[item] for item in txt])

# Read compressible binary so we can uncompress it. We strip the first bit.
read_compressed_binary = compressed_binary[1:]

# Recreate the compressed message from read_compressed_binary
remainder,bytestr = reduce(decompressHuffmanCode, read_compressed_binary, ('', ''))
print(bytestr)

结果为:

CompresssionIsCoolWithHuffman

这是一个快速的实现,应该会有所帮助。可以通过编程方式处理的事情是缓冲区,但是我只是想向您展示使用您的频率代码的快速实现

答案 1 :(得分:0)

我认为使用python字典结构来表示树和节点就足够了。您真的不需要一个单独的类。

您要初始化所有节点:

def huffman_tree(freq_dict):
    vals = freq_dict.copy()
    nodes = {}
    for n in vals.keys():
        nodes[n] = []

在这里,我们已经初始化了一个字典nodes来表示节点和叶子。让我们用数据填充它;在同一功能内:

    while len(vals) > 1:
        s_vals = sorted(vals.items(), key=lambda x:x[1]) 
        a1 = s_vals[0][0]
        a2 = s_vals[1][0]
        vals[a1+a2] = vals.pop(a1) + vals.pop(a2)
        nodes[a1+a2] = [a1, a2]

您看到,我现在首先按照频率字典升序对数据进行排序。您之前做过,所以您现在不必这样做(尽管您是递减的)。不过,在这样的while循环中,稍后而不是更快地执行此操作,可以让您更自由地使用传递给程序的频率词典。我们在这里要做的是,在排序时从freq_dict中获取两个和两个项目,将它们加在一起并存储在freq_dict中。

现在,我们需要遍历freq_dict,并构造某种符号词典,以表示用于与符号交换文本的规则集。仍然在同一功能内:

    symbols = {} # this will keep our encoding-rules
    root = a1+a2 # a1 and a2 is our last visited data,
                 # therefore the two largest values
    tree = label_nodes(nodes, root, symbols)

    return symbols, tree

此处带有tree = ...的行似乎有些神奇,但这是因为我们尚未创建函数。但是,假设有一个函数可以从根到叶递归地遍历每个节点,并添加一个表示编码符号的字符串前缀“ 0”或“ 1”(这就是为什么我们对升序进行排序,以便最大程度地获得升序)顶部的常用单词,接收到最小的编码符号):

def label_nodes(nodes, label, symbols, prefix = ''):        
    child = nodes[label]
    tree = {}
    if len(child) == 2:
        tree['0'] = label_nodes(nodes, child[0], symbols, prefix+'0')
        tree['1'] = label_nodes(nodes, child[1], symbols, prefix+'1')     
        return tree
    else: # leaf
        symbols[label] = prefix
        return label

此功能就是这样做的。现在我们可以使用它了:

def huffman_encode(string, symbols):
    return ''.join([symbols[str(e)] for e in string])

text =  '''This is a simple text, made to illustrate how
        a huff-man encoder works. A huff-man encoder works
        best when the text is of reasonable length and has
        repeating patterns in its language.'''

fd = freq_dict(text)    
symbols, tree = huffman_tree(fd)    
huffe = huffman_encode(text, symbols)
print(huffe)
  

输出:001001011001110100011011101000110110100100111101011000110011010010000011011000110000110110010011011100011010111010000011011110110111010100101001011101100111011111001010101100001110011101111110010011101010101010101011010011100111101100101001011111100110001101010000100010001111101110111110100001110001111100110111110011111100011111111101110000001110011110110010100101111110011000110101000010001000111110111011111010000111000111110011011111001111110001110011101010101010101010010000000011101101111100110010001000011011110010000110110001100001101101110100011011101100101011110000010100011110111000101000100010010000011001000010001111011011110010110101000111010011100110100011100111010101010101010111100000100110000101010111101010001111010110011010101011101100011100100000110111010100001110101011001101100101010100011110111101110101111010001111111

解码是遍历树的简单问题:

def huffman_decode(encoded, tree, string=False):  
    decoded = []
    i = 0
    while i < len(encoded):
        sym = encoded[i]
        label = tree[sym]
        # Continue untill leaf is reached
        while not isinstance(label, str):
            i += 1
            sym = encoded[i]
            label = label[sym]        
        decoded.append(label)        
        i += 1
    if string == True:
        return ''.join([e for e in decoded])
    return decoded

print(huffman_decode(huffe, tree, string=True))
  

出:这是一个简单的文本,用来说明如何           霍夫曼编码器工作。霍夫曼编码器工作           文本长度合理且具有           重复其语言中的模式。

这个答案大部分是从我自己的GitHub:https://github.com/RoyMN/Python/tree/master/tools/text_handler

被盗的

答案 2 :(得分:-2)

要创建基本的霍夫曼树,您必须首先按升序对频率列表进行排序,首先将已排序容器中的前两个元素配对,然后将这些元素创建一个新的霍夫曼树节点,然后将新节点添加到容器中,重新使用容器,然后继续该过程,直到容器仅容纳一个节点:完整树。要遍历树,您可以在类中创建一个查找方法,该方法产生1 s和0 s的路径:如果要搜索的值在右侧,则记录1当前节点的子节点;如果位于左侧,则为0

from collections import Counter
def get_frequencies(data):
  return sorted(Counter(data).items(), key=lambda x:x[-1])

class Node:
  def __init__(self, _a, _b):
    self.left, self._sum, self.right = _a, sum(i._sum if hasattr(i, '_sum') else i[-1] for i in [_a, _b]), _b
  def __contains__(self, _val):
    if isinstance(self.left, tuple) and self.left[-1] == _val or isinstance(self.right, tuple) and self.right[-1] == _val:
      return True
    return any([_val in self.left, _val in self.right])
  def _lookup(self, chr, _path = []):
    if isinstance(self.left, tuple) and self.left[-1] == chr:
      return _path+[0]
    if isinstance(self.right, tuple) and self.right[-1] == chr:
      return _path+[1]
    _flag = 'right' if chr in self.right else 'left'
    return getattr(getattr(self, _flag), '_lookup', lambda _, p:p)(chr, _path+[1 if _flag == 'right' else 0])
  def __getitem__(self, chr):
     return self._lookup(chr)

corpus = "Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding and/or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT"
fs = get_frequencies(corpus)
while len(fs) > 1:
  a, b, *_fs = fs
  fs = sorted(_fs+[Node(a, b)], key=lambda x:x._sum if hasattr(x, '_sum') else x[-1])

_tree = fs[0]
final_result = ''.join(''.join(map(str, _tree[i])) for i in corpus)

输出:

'1001111010100000000010001101000011111101110110010110011100110111111101011101011101010010110100011110110101001100101010010111110100100110101111001111011000011110110101111010001110001101001100111010111001011000000001110011000111110111011001011001111101001000101011010111001101111111101110111000110001101100010110001001111101010011111000010111000010111001011101100101101110111011001100011101111110010101011010101011111011101110001010111001011000111011100111011000101101011101001010100011001110101110010101111011110001110111111101100001110000001100010010001100010110111111010000100101001100110111001011101010011100110001011011111011101010110110100011110101111101110110010110011101011100101011110111100110000100111111100000001001111110001110010100001011111110110000111100111101010000000001000110100001111110111011001000110001011011100110101111010000111110100110001101110111001000111101001000100011110010110010000011100011001011010111100001011110000000100111111000010101010000010011001011110011011011010111100111101010000000001000110100001111100001101000001101100110011101000110011110000111010011111110101111001110011011011010100001001101011101111101001010001011000001110101111010110101111001110101001000100101'

通过循环原始输入(corpus)并通过Node.__getitem__找到每个字符的霍夫曼编码来生成最终的压缩结果。