Question

我正在编写一个Huffman文件，我将规范代码的代码长度存储在文件的标题中。在解码过程中，我能够重新生成规范代码并将它们存储到std::map<std:uint8_it, std::vector<bool>>中。实际数据被读入单个std::vector<bool>。在任何人建议我使用std::bitset之前，让我澄清一下，霍夫曼代码的位长可变，因此，我使用的是std::vector<bool>。所以，鉴于我有我的符号和相应的规范代码，我如何解码我的文件？我不知道从哪里开始。有人可以向我解释如何解码这个文件，因为我在搜索时找不到与之相关的任何内容。

Answer 1

您无需创建代码或树来解码规范代码。您所需要的只是按顺序排列的符号列表和每个代码长度中的符号数。顺序＆＃34;顺序＆＃34;，我的意思是按照从最短到最长的代码长度排序，并在每个代码长度内，按符号值排序。

由于代码长度内的规范代码是连续的二进制整数，因此您可以简单地进行整数比较，以查看您所在的位是否属于该代码范围，如果是，则使用整数减法来确定它是哪个符号。

以下是来自puff.c的代码（稍作修改），以明确说明如何完成此操作。 bits(s, 1)返回流中的下一位。（这假设始终存在下一位。）h->count[len]是由len代码编码的符号数，其中len位于0..MAXBITS中。如果您将h->count[1]，h->count[2]，...，h->count[MAXBITS]加起来，即编码的符号总数，并且是h->symbol[]数组的长度。 h->count[1]中的第一个h->symbol[]符号的长度为1. h->count[2]中的下一个h->symbol[]符号的长度为2.依此类推。

h->count[]数组中的值（如果正确）被限制为不会超额预订可以用len位编码的可能代码数。可以进一步约束它来表示完整的代码，即没有未定义的位序列，在这种情况下decode()不能返回错误（-1）。要使代码完整而不是超额订阅，h->count[len] << (MAXBITS - len)与len的总和必须等于1 << MAXBITS。

简单示例：如果我们使用一位编码e，使用两位编码t，使用三位编码a和o，则h->count[]是{0, 1, 1, 2}（第一个值，h->count[0]未使用），h->symbol[]是{'e','t','a','o'}。然后，e的代码为0，t的代码为10，a为110，o为111。

#define MAXBITS 15              /* maximum bits in a code */

struct huffman {
    short *count;       /* number of symbols of each length */
    short *symbol;      /* canonically ordered symbols */
};

int decode(struct state *s, const struct huffman *h)
{
    int len;            /* current number of bits in code */
    int code;           /* len bits being decoded */
    int first;          /* first code of length len */
    int count;          /* number of codes of length len */
    int index;          /* index of first code of length len in symbol table */

    code = first = index = 0;
    for (len = 1; len <= MAXBITS; len++) {
        code |= bits(s, 1);             /* get next bit */
        count = h->count[len];
        if (code - count < first)       /* if length len, return symbol */
            return h->symbol[index + (code - first)];
        index += count;                 /* else update for next length */
        first += count;
        first <<= 1;
        code <<= 1;
    }
    return -1;                          /* ran out of codes */
}

Answer 2

您的地图包含相关信息，但它会将符号映射到代码。但是，您尝试解码的数据包含代码。因此，由于查找方法需要符号，因此您的映射无法用于获取与以有效方式读取的代码对应的符号。搜索代码并检索相应的符号将是线性搜索。

相反，您应该重建为压缩步骤构建的霍夫曼树。内部节点的频率值在这里是无关紧要的，但是您需要在正确的位置处使用叶节点。您可以在读取文件头时动态创建树。最初创建一个空树。对于您读取的每个符号到代码映射，在树中创建相应的节点。例如。如果符号＆＃39; D＆＃39;已经映射到代码101，然后确保在根处有一个正确的子节点，它有一个左子节点，它有一个右子节点，其中包含符号＆＃39; D＆＃39 ;,创建节点如果丢失了。

使用该树，您可以按如下方式对流进行解码（伪代码，假设正确的子代对应于在代码中添加1）：

// use a node variable to remember the position in the tree while reading bits
node n = tree.root
while(stream not fully read) {
    read next bit into boolean b
    if (b == true) {
        n = n.rightChild
    } else {
        n = n.leftChild
    }
    // check whether we are in a leaf node now
    if (n.leftChild == null && n.rightChild == null) {
        // n is a leaf node, thus we have read a complete code
        // add the corresponding symbol to the decoded output
        decoded.add(n.getSymbol())
        // reset the search
        n = tree.root
    }
}

请注意，反转地图以使查找到正确的方向仍将导致次优性能（与二叉树遍历相比），因为它不能像遍历那样利用对较小搜索空间的限制。 / p>

从规范形式解码霍夫曼文件

2 个答案: