Question

我正在构建一个LZW编码算法，该算法使用字典和哈希，因此它可以快速到达已存储在工作字中的工作字字典。

算法在较小的文件（cca几百个符号）上运行时会给出正确的结果，但是在较大的文件上（特别是那些包含较少不同符号的文件 - 例如，它在运行时表现最差）一个只包含1个符号的文件，'y'让我们说）。最糟糕的表现，就是当词典甚至没有接近满员时它会崩溃。但是，当大输入文件包含多个符号时，字典接近满，大约90％，但它再次崩溃。

考虑到我的算法的结构，我不太确定是什么原因导致它崩溃，或者在给出仅1个符号的大文件时很快崩溃。它必须是关于哈希的东西（第一次这样做，所以它可能有一些错误）。

我正在使用的哈希函数可以在这里找到，并且根据我测试它，它可以得到很好的结果：oat_hash

LZW编码算法基于此链接，稍有变化，直到字典未满：LZW encoder

让我们进入代码：

注意： oat_hash已更改，因此返回值％CAPACITY，因此每个索引都来自DICTIONARY

    // Globals
#define CAPACITY 100000
char *DICTIONARY[CAPACITY];
unsigned short CODES[CAPACITY]; // CODES and DICTIONARY are linked via index: word from dictionary on index i, has its code in CODES on index i
int position = 0;
int code_counter = 0;

void encode(FILE *input, FILE *output){

int succ1 = fseek(input, 0, SEEK_SET);
if(succ1 != 0) printf("Error: file not open!");

int succ2 = fseek(output, 0, SEEK_SET);
if(succ2 != 0) printf("Error: file not open!");

//1. Working word = next symbol from the input
char *working_word = malloc(2048*sizeof(char));
char new_symbol = getc(input);
working_word[0] = new_symbol;
working_word[1] = '\0';



//2. WHILE(there are more symbols on the input) DO
//3. NewSymbol = next symbol from the input
while((new_symbol = getc(input)) != EOF){

    char *workingWord_and_newSymbol= NULL;
    char newSymbol[2];
    newSymbol[0] = new_symbol;
    newSymbol[1] = '\0';

    workingWord_and_newSymbol = working_word_and_new_symbol(working_word, newSymbol);

    int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));

    //4. IF(WorkingWord + NewSymbol) is already in the dictionary THEN
    if(DICTIONARY[index] != NULL){
        // 5. WorkingWord += NewSymbol
        working_word = working_word_and_new_symbol(working_word, newSymbol);

    }
    //6. ELSE
    else{
        //7. OUTPUT: code for WorkingWord
        int idx = oat_hash(working_word, strlen(working_word));

        fprintf(output, "%u", CODES[idx]);

        //8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
        if(!dictionary_full()){
            DICTIONARY[index] = workingWord_and_newSymbol;
            CODES[index] = code_counter + 1;
            code_counter += 1;
            working_word = strdup(newSymbol);
        }else break;

    }
    //10. END IF
}
//11. END WHILE

//12. OUTPUT: code for WorkingWord
int index = oat_hash(working_word, strlen(working_word));
fprintf(output, "%u", CODES[index]);

free(working_word);

}

Answer 1

 int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol));

以后

    int idx = oat_hash(working_word, strlen(working_word));

    fprintf(output, "%u", CODES[idx]);

    //8. Add (WorkingWord + NewSymbol) into a dictionary and assign it a new code
    if(!dictionary_full()){
        DICTIONARY[index] = workingWord_and_newSymbol;
        CODES[index] = code_counter + 1;
        code_counter += 1;
        working_word = strdup(newSymbol);
    }else break;

idx和index是无界的，您可以使用它们来访问有界数组。 你正在访问超出范围的内存。这是一个建议，但它可能会扭曲分发。如果你的哈希范围比CAPACITY大得多，那应该不是问题。但是你还有另外一个问题，即碰撞，你需要处理它们。但这是一个不同的问题。

int index = oat_hash(workingWord_and_newSymbol, strlen(workingWord_and_newSymbol)) % CAPACITY;
// and
int idx = oat_hash(working_word, strlen(working_word)) % CAPACITY;

Answer 2

LZW压缩肯定用于构造二进制文件，通常能够读取二进制文件。

以下代码存在问题，因为它依赖于new_symbol永远不会是\0。

newSymbol[0] = new_symbol; newSymbol[1] = '\0';
strlen(workingWord_and_newSymbol)
strdup(newSymbol)

需要重写以使用字节数组而不是字符串。

fopen()未显示。确保一个人以二进制打开。 input = fopen(..., "rb");

@Wumpus Q.Wumbley是正确的，使用int newSymbol。

次要：

new_symbol和newSymbol令人困惑。

考虑：

// char *working_word = malloc(2048*sizeof(char));
#define WORKING_WORD_N (2048)
char *working_word = malloc(WORKING_WORD_N*sizeof(*working_word));
// or 
char *working_word = malloc(WORKING_WORD_N);

LZW编码为大文件

2 个答案: