Question

我需要使用包含ASCII和Huffman位之间的转换的文件解码我用我的程序编码的霍夫曼代码。我已经在＆＃34;代码＆＃34;像这样的ASCII：

{'01110': '!', '01111': 'B', '10100': 'l', '10110': 'q', '10111': 'y'}

我创建了这个函数：

def huffmanDecode (dictionary, text) :

这需要字典和代码。我尝试在字典中搜索文本以获取密钥并使用替换方法表单字符串和来自 re 但他们都没有正确地解码消息。例如，如果代码是：

011111011101110

将其解码为：

应该很简单

By!

但我没有能够通过迭代代码和搜索字典中的匹配来做到这一点！

如何通过查找文本中的键并将其替换为值来使用字典中的键及其值来解码代码？

非常感谢任何帮助。

Answer 1

使用bitarray模块，您可以免费获得霍夫曼编码/解码，并且可能比其他任何方式更有效：

from bitarray import bitarray

huffman_dict = {
    '!': bitarray('01110'), 'B': bitarray('01111'),
    'l': bitarray('10100'), 'q': bitarray('10110'),
    'y': bitarray('10111')
}

a = bitarray()
a.encode(huffman_dict, 'Bylly!')
print(a)

dec = bitarray('011111011101110').decode(huffman_dict)
print(dec)
print(''.join(dec))

# # output:
# bitarray('011111011110100101001011101110')
# ['B', 'y', '!']
# By!

如果您不想安装模块，请阅读以下部分。

这是一个使用 huffman树进行解码的变体 - 该程序可以运行，但可能有更好的变体来表示二叉树（我选择了一个元组）。

当您的代码字长度不同时，此版本可能更适合。关于二叉树的另一个好处是，这里很明显代码是无前缀的。

您的树形代码看起来像这样（过度缩进以使树结构可见）：

huffman_tree = \
    (   # 0
        (   # 00
            None,
            # 01
            (   # 010
                None,
                # 011
                (   # 0110
                    None,
                    # 0111
                    (   # 01110
                        '!',
                        # 01111
                        'B')))),
        # 1
        (   # 10
            (   # 100
                None,
                # 101
                (   # 1010
                    (   # 10100
                        'l',
                        # 10101
                        None
                    ),
                    # 1011
                    (   # 10110
                        'q',
                        # 10111
                        'y'))),
            # 11
            None))

使用它然后你可以解码：

def huffman_decode(strg):
    ret = ''
    cur_node = huffman_tree
    for char in strg:
        cur_node = cur_node[int(char)]
        if cur_node is None:
            raise ValueError
        elif isinstance(cur_node, str):
            ret += cur_node
            cur_node = huffman_tree
    return ret

print(huffman_decode('011111011101110'))

如果解码命中None，则会发生一些错误并引发ValueError。一旦解码到达一个字符串，当前节点cur_node就会重置为“根节点”，游戏将从树的开头开始。

并且因为我可以：这里是你的（不完整的）霍夫曼树的视觉显示（这可能有助于理解算法的作用：每当遇到0时：向右+向下;每当遇到1：向右走+向上）;如果您点击了一个终端节点：返回该节点上的字符并在根节点重新启动。

Answer 2

不确定您尝试了什么，但re.sub或replace可能无法正常工作，因为它们不一定会从字符串的开头替换。您必须查看字符串的起始代码，然后替换该代码，然后继续执行其余的字符串。

例如，像这样：

def huffmanDecode (dictionary, text):
    res = ""
    while text:
        for k in dictionary:
            if text.startswith(k):
                res += dictionary[k]
                text = text[len(k):]
    return res

或递归：

def huffmanDecode (dictionary, text):
    if text:
        k = next(k for k in dictionary if text.startswith(k))
        return dictionary[k] + huffmanDecode(dictionary, text[len(k):])
    return ""

您也可以使用代码制作正则表达式，然后使用re.match查找下一个代码：

import re
def huffmanDecode (dictionary, text):
    p = "|".join(dictionary) # join keys to regex
    res = ""
    while text:
        m = re.match(p, text)
        res += dictionary[m.group()]
        text = text[len(m.group()):]
    return res

注意：如果代码与消息不匹配，这些都没有正确的错误处理并且会失败或永远循环，但这应该让你开始。

用字典解码霍夫曼代码

2 个答案: