Question

我正在尝试编写一个模块，将霍夫曼编码的单词分配给输入符号，但给定的代码与它们的外观不同。

例如，如果我使用以下符号概率运行它：

（第1栏：概率;第2栏：我的霍夫曼码;第3栏：正确的霍夫曼码）

0,25 - ＆gt; 01 - ＆gt; 10

0,15 - ＆gt; 101 - ＆gt; 111

0,15 - ＆gt; 110 - ＆gt; 110

0,1 - ＆gt; 1111 - ＆gt; 010

0,1 - ＆gt; 000 - ＆gt; 001

0,05 - ＆gt; 0010 - ＆gt; 0110

0,05 - ＆gt; 0011 - ＆gt; 0001

0,05 - ＆gt; 1000 - ＆gt; 0000

0,05 - ＆gt; 1001 - ＆gt; 01111

0,05 - ＆gt; 1110 - ＆gt; 01110

我认为这个问题可能是由于生成霍夫曼代码的功能造成的，因为 strcat（）函数的行为最初对我的想法不利，所以我把它合并了使用 strcat（）。不确定这样做是不是很好。

我正在为您提供两个负责代码分配的功能， build_huffman_tree（）和 generate_huffman_tree（），希望您可以帮助我解决这个问题，并指出在哪里问题可能是。

生成guffman树：

void generate_huffman_tree(node *n, char *code){
if(n->left== NULL && n->right== NULL){
    SYMBOLS[code_counter] = n->symbol; // this 3 lines just store current code, not important
    CODES[code_counter] = strdup(code);
    code_counter += 1;
}
if(n->left!= NULL){
    char temp[100];
    strcpy(temp, code);
    strcat(temp, "0");
    generate_huffman_tree(n->left, temp);
}
if(n->right!= NULL){
    char temp[100];
    strcpy(temp, code);
    strcat(temp, "1");
    generate_huffman_tree(n->right, temp);
}

构建霍夫曼树：

node *build_huffman_tree(double *probabilities){

int num_of_nodes = NUM_SYMBOLS;
int num = NUM_SYMBOLS;

// 1) Initialization: Create new node for every probability
node *leafs = (node*) malloc(num_of_nodes*sizeof(node));
int i;
for(i=0; i<num_of_nodes; i+=1){
    node c;
    c.probability= *(probability+ i);
    c.symbol= *(SYMBOLS + i);
    c.left= NULL;
    c.right= NULL;
    *(leafs+i) = c;
}

node *root= (node*) malloc(sizeof(node)); // Root node which will be returned

while(num_of_nodes> 1){

    // 2) Find 2 nodes with lowest probabilities
    node *min_n1= (node*)malloc(sizeof(node));
    node *min_n2 = (node*)malloc(sizeof(node));

    *min_n1 = *find_min_node(leafs, num, min_n1);
    leafs = remove_node(leafs, min_n1, num); 
    num -= 1;

    *min_n2= *find_min_node(leafs, num, min_n2);
    leafs = remove_node(leafs, min_n2, num);
    num -= 1;

    // 3) Create parent node, and assign 2 min nodes as its children
            // add parent node to leafs, while its children have been removed from leafs
    node *new_node = (node*) malloc(sizeof(node));
    new_node->probabilty= min_n1->probability + min_n2->probability;
    new_node->left= min_n1;
    new_node->right= min_n2;

    leafs = add_node(leafs, new_node, num);
    num += 1;

    num_of_nodes -= 1;

    root = new_node;
}

return root;

我已经测试了用于查找2分钟节点，删除和添加节点到叶子结构的函数，并且它被证明可以正常工作，所以我想问题应该是关于此的问题。

Answer 1

我没有查看您的源代码，但您生成的Huffman代码没有任何问题。你所谓的“正确的霍夫曼代码”也没有错。使用该组概率可能存在多个有效的霍夫曼代码。如果你取两个霍夫曼码的概率乘以比特长度，你会发现这些总和完全相同。两个霍夫曼码都是最优的，即使它们不同。

这种情况发生的方式是，当您寻找两个最低频率时，有多个选择。根据您的选择，您将获得不同的树。

Answer 2

以下代码是Mark Allen Weiss算法的实现。试一试！

除了提供根据以前构成的每个字母代码显示结果的功能外，它还提供与您类似的例程。

使用的编译器是MinGW 2.95（C-Free 4.0）。

先决条件：

带有文本的输入文件（任何，但请记住，它仅处理字母字符，没有标点符号，没有空格，没有数字）。常量IN_PATH是您应该修改为指向文本文件的正确位置以便成功运行程序的常量。

图像显示示例文本，字母比例和霍夫曼代码解释的结果（字母用一个空格分隔）。

祝你好运！

//*******************************************************************
// File:            HuffmanEncoding - Tree.c
// Author(s):       Mohamed Ennahdi El Idrissi
// Date:            14-Aug-2012
// 
// Input Files:     in.txt
// Output Files:    out.txt
// Description:     CSC 2302 - <Data Structures>
//                             <Struct, Array, File I/O, Recursion, Pointers, Binary Tree>
//                  This program covers the Huffman Encoding concept.
//                  We first read a file, from we which we count the number of characters, and then reckon the frequency
//                  of each letter individually. Each letter's frequency is stored in a node with its respective character.
//                  This node is stored in an array of 26 elements (element 0 -> 'A', element 1 -> 'B'...element 25 -> 'Z').
//                  Each element is a pointer, and each pointer is supposed to be a root of a tree (sub tree).
//                  After processing all characters of the text (read from a file), we end up with an array with
//                  25 NULL elements. The only element that is not NULL is the root of the tree that gathers the different
//                  nodes of each letter.
//                  Deducing the encoding of each letter if performed with intermediary of the prefix traversal.
//                  To summarize, the pseudo-code is:
//                      - Initialize the letters array
//                      - Read file
//                          - Increment each letter frequency + compute the number of characters in the file
//                          - Store in the array's node the frequency of each letter
//                      - Compute the number (N) of involved characters (Sometimes, texts don't include all letters. In our case 'Q' and 'Z' are absent).
//                      - Loop N times
//                          - find Minimum and second minimum
//                          - create a new node, its left child contains the minimum and the right child contains the second minimum
//                          - minimum position points on the new node, and the second minimum's array position points on NULL
//                      - Browse the array till the unique non NULL element is encountered
//                          - invoke prefix traversal function
//                              - build the encoding of each character
//                              - display the letter and its characteristics when found.
//                      - Finally, read the output file to interpret its content
//                          - if root contains a character (A - Z), display character
//                          - else, if the current character is '0', browse the left leaf
//                          - else, if the current character is '1', browse the right leaf
//                          
//*******************************************************************
#include <stdio.h>

#define NBR_OF_LETTERS 26

#define LEFT 'L'
#define RIGHT 'R'

#define CODE_SIZE 128

#define TYPED_ALLOC(type) (type *) malloc( sizeof(type) )
#define BYTE_SIZE 8

#define IN_PATH "./files/in.txt"
#define OUT_PATH "./files/out.txt"

typedef struct tree_node_s {
    float frequency;
    char c;
    char code[CODE_SIZE];
    struct tree_node_s *left;
    struct tree_node_s *right;
} tree_node_t;

tree_node_t *arr[NBR_OF_LETTERS], *letters[NBR_OF_LETTERS];


void findMinAndSecondMin(tree_node_t **, float *, int *, float *, int *);
void printTree(tree_node_t *);
void interpret(char *, int *, tree_node_t *);
void printTree(tree_node_t *);
void encode(tree_node_t *, tree_node_t **, char, short, char*);

/*
 *
 */
int main() {
    char str[CODE_SIZE];
    int fileReadingVerdict;
    int i, j, k, index, n;
    float min, secondMin;
    int minIndex, secondMinIndex;
    int numberOfCharacters = 0;
    tree_node_t *tree;
    FILE *in = fopen(IN_PATH, "r");
    FILE *out;
    if ( in == NULL ) {
        printf("\nFile not found");
        return 0;
    } else {
        /*
         *  Begin: Array Initialization
         */
        for (i = 'A'; i <= 'Z'; i++) {
            index = i - 'A';
            arr[index] = NULL;
        }
        /*
         *  End:    Array Initialization
         */
        numberOfCharacters = 0;
        fileReadingVerdict = fgets(str, CODE_SIZE, in) != NULL;
        while(!feof(in) || fileReadingVerdict) {
            n = strlen(str);
            printf("\n%s", str);
            for (i = 0; i < n ; i ++ ) {
                str[i] = toupper(str[i]);
                if (str[i] >= 'A' && str[i] <= 'Z') {
                    numberOfCharacters ++;
                    index = str[i] - 'A';
                    if (arr[index] == NULL) {
                        arr[index] = TYPED_ALLOC(tree_node_t);// malloc(sizeof(tree_node_t));
                        arr[index]->c = str[i];
                        arr[index]->frequency = 1;
                        arr[index]->left = arr[index]->right = NULL;
                    } else {
                        arr[index]->frequency += 1;
                    }
                }
            }
            if (fileReadingVerdict) {
                fileReadingVerdict = fgets(str, CODE_SIZE, in) != NULL;
            }
        }
    }   
    fclose(in);

    for ( i = 0, n = 0 ; i < NBR_OF_LETTERS ; i ++ ) {
        letters[i] = arr[i];
        if (arr[i] != NULL) {
            arr[i]->frequency /= numberOfCharacters;    // Computing the frequency.
            n ++;                                       // n is the number of involved letters which is going to be consumed in the do while loop's condition
        }
    }

    j = 1;
    do {
        findMinAndSecondMin(arr, &min, &minIndex, &secondMin, &secondMinIndex);

        if (minIndex != -1 && secondMinIndex != -1 && minIndex != secondMinIndex) {
            tree_node_t *temp;
            tree = TYPED_ALLOC(tree_node_t);// malloc(sizeof(tree_node_t));
            tree->frequency = arr[minIndex]->frequency + arr[secondMinIndex]->frequency;
            tree->c = j;
            tree->left = arr[minIndex];
            temp = TYPED_ALLOC(tree_node_t);// malloc(sizeof(tree_node_t));
            temp->c = arr[secondMinIndex]->c;
            temp->frequency = arr[secondMinIndex]->frequency;
            temp->left = arr[secondMinIndex]->left;
            temp->right = arr[secondMinIndex]->right;
            tree->right = temp;

            arr[minIndex] = tree;

            arr[secondMinIndex] = NULL;
        } 
        j ++;
    } while( j < n );

    for ( i = 0 ; i < NBR_OF_LETTERS ; i ++ ) {
        if (arr[i] != NULL)  {
            char code[CODE_SIZE];
            strcpy(code, "");
            encode(tree = arr[i], letters, 0, 0, code);
            puts("\nSuccessful encoding");
            printTree(arr[i]);
            break;
        }
    }
    in = fopen(IN_PATH, "r");
    out = fopen(OUT_PATH, "w");
    fileReadingVerdict = fgets(str, CODE_SIZE, in) != NULL;
    while(!feof(in) || fileReadingVerdict) {
        n = strlen(str);
        for (i = 0; i < n ; i ++ ) {
            str[i] = toupper(str[i]);
            if (str[i] >= 'A' && str[i] <= 'Z') {
                index = str[i] - 'A';
                fputs(letters[index]->code, out);
            }
        }
        if (fileReadingVerdict) {
            fileReadingVerdict = fgets(str, CODE_SIZE, in) != NULL;
        }
    }

    fclose(in);
    fclose(out);

    printf("\nFile size (only letters) of the input file:   %d bits", numberOfCharacters * BYTE_SIZE);

    out = fopen(OUT_PATH, "r");
    fileReadingVerdict = fgets(str, CODE_SIZE, out) != NULL;
    numberOfCharacters = 0;
    while(!feof(out) || fileReadingVerdict) {
        numberOfCharacters += strlen(str);
        if (fileReadingVerdict) {
            fileReadingVerdict = fgets(str, CODE_SIZE, out) != NULL;
        }
    }
    fclose(out);    

    printf("\nFile size of the output file: %d bits", numberOfCharacters);

    printf("\nInterpreting output file:\n");
    out = fopen(OUT_PATH, "r");
    fileReadingVerdict = fgets(str, CODE_SIZE, out) != NULL;
    while(!feof(out) || fileReadingVerdict) {
        n = strlen(str);
        i = 0 ;
        while(i < n) {
            interpret(str, &i, tree);
        }
        if (fileReadingVerdict) {
            fileReadingVerdict = fgets(str, CODE_SIZE, out) != NULL;
        }
    }
    fclose(out);

    puts("\n");
    return 0;
}
/*
 *
 */
void encode(tree_node_t *node, tree_node_t **letters, char direction, short level, char* code) {
    int n;
    if ( node != NULL ) {
        if ((n = strlen(code)) < level) {
            if (direction == RIGHT) {
                strcat(code, "1");
            } else {
                if (direction == LEFT) {
                    strcat(code, "0");
                }
            }
        } else {
            if (n >= level) {
                code[n - (n - level) - 1] = 0;
                if (direction == RIGHT) {
                    strcat(code, "1");
                } else {
                    if (direction == LEFT) {
                        strcat(code, "0");
                    }
                }   
            }
        }
        if (node->c >= 'A' && node->c <= 'Z') {
            strcpy(node->code, code);
            strcpy(letters[node->c - 'A']->code, code);
        }
        encode(node->left, letters, LEFT, level + 1, code);
        encode(node->right, letters, RIGHT, level + 1, code);
    }
}

void printTree(tree_node_t *node) {
    int n;
    if ( node != NULL ) {
        if (node->c >= 'A' && node->c <= 'Z') {
            printf("\t%c - frequency: %.10f\tencoding: %s\n", node->c, node->frequency, node->code);
        }
        printTree(node->left);
        printTree(node->right);
    }
}

/*
 *  Begin:  Minimum and second minimum
 */
void findMinAndSecondMin(tree_node_t *arr[], float *min, int *minIndex, float *secondMin, int *secondMinIndex) {
    int i, k;
    k = 0;
    *minIndex = -1;
    /*
     * Skipping all the NULL elements.
     */     
    while (k < NBR_OF_LETTERS && arr[k] == NULL) k++;

    *minIndex = k;
    *min = arr[k]->frequency;

    for ( i = k ; i < NBR_OF_LETTERS; i ++ ) {
        if ( arr[i] != NULL && arr[i]->frequency < *min ) {
            *min = arr[i]->frequency;
            *minIndex = i;
        }
    }

    k = 0;
    *secondMinIndex = -1;
    /*
     * Skipping all the NULL elements.
     */         
    while ((k < NBR_OF_LETTERS && arr[k] == NULL) || (k == *minIndex && arr[k] != NULL)) k++;

    *secondMin = arr[k]->frequency;
    *secondMinIndex = k;

    if (k == *minIndex) k ++;

    for ( i = k ; i < NBR_OF_LETTERS; i ++ ) {
        if ( arr[i] != NULL && arr[i]->frequency < *secondMin && i != *minIndex ) {
            *secondMin = arr[i]->frequency;
            *secondMinIndex = i;
        }
    }
    /*
     *  End:    Minimum and second minimum
     */
}

void interpret(char *str, int *index, tree_node_t *tree) {
    int n = strlen(str);
    if (tree->c >= 'A' && tree->c <= 'Z') {
        printf("%c ", tree->c);
        return ;
    } else {
        if ( *index < n ) {
            if (str[*index] == '0') {
                (*index) ++;
                interpret(str, index, tree->left);
            } else {
                if (str[*index] == '1') {
                    (*index) ++;
                    interpret(str, index, tree->right);
                }
            }
        }
    }
}

enter image description here

霍夫曼在C编码

2 个答案: