Question

我在C中创建了一个程序，它读入一个单词文件并计算该文件中有多少单词，以及每个单词出现的次数。

当我通过Valgrind运行它时，我会丢失太多字节或者出现分段错误。

如何从动态分配的数组中删除重复元素并释放内存？

int tokenize(Dictionary **dictionary, char *words, int total_words)
{

    char *delim = " .,?!:;/\"\'\n\t";
    char **temp = malloc(sizeof(char) * strlen(words) + 1);
    char *token = strtok(words, delim);

    *dictionary = (Dictionary*)malloc(sizeof(Dictionary) * total_words);

    int count = 1, index = 0;

    while (token != NULL)
    {
        temp[index] = (char*)malloc(sizeof(char) * strlen(token) + 1);
        strcpy(temp[index], token);

        token = strtok(NULL, delim);

        index++;
    }

    for (int i = 0; i < total_words; ++i)
    {
        for (int j = i + 1; j < total_words; ++j)
        {
            if (strcmp(temp[i], temp[j]) == 0) // <------ segmentation fault occurs here
            {
                count++;

                for (int k = j; k < total_words; ++k) // <----- loop to remove duplicates
                    temp[k] = temp[k+1];

                total_words--;
                j--;
            }
        }


        int length = strlen(temp[i]) + 1;
        (*dictionary)[i].word = (char*)malloc(sizeof(char) * length);

        strcpy((*dictionary)[i].word, temp[i]);
        (*dictionary)[i].count = count;

        count = 1;
    }

    free(temp);
    return 0;
}

提前致谢。

Answer 1

如果没有A Minimal, Complete, and Verifiable example，则无法保证其他问题不会源自您的代码中的其他地方，但以下内容需要特别注意：

    char **temp = malloc(sizeof(char) * strlen(words) + 1);

上面你要分配指针而不是单词，你的分配太小了sizeof (char*) - sizeof (char)因子。为了防止出现此类问题，如果您使用sizeof *thepointer，您将始终拥有正确的尺寸，例如

    char **temp = malloc (sizeof *temp * strlen(words) + 1);

（除非你打算提供 sentinel NULL 作为最终指针，否则+ 1是不必要的。你还必须验证返回（见下文））

下一步：

    *dictionary = (Dictionary*)malloc(sizeof(Dictionary) * total_words);

没有必要强制转换malloc，这是不必要的。见：Do I cast the result of malloc?。此外，如果先前在其他地方分配了*dictionary，则上面的分配会因为丢失对原始指针的引用而产生内存泄漏。如果之前已分配，则需要realloc，而不是malloc。如果没有分配，更好的写作方式是：

    *dictionary = malloc (sizeof **dictionary * total_words);

您必须在尝试使用内存块之前验证分配成功，例如

if (! *dictionary) { perror ("malloc - *dictionary"); exit (EXIT_FAILURE); }

在：

temp[index] = (char*)malloc(sizeof(char) * strlen(token) + 1);

sizeof(char)始终为1，可以省略。写得更好：

temp[index] = malloc (strlen(token) + 1);

或更好，在单个块中分配和验证：

if (!(temp[index] = malloc (strlen(token) + 1))) { perror ("malloc - temp[index]"); exit (EXIT_FAILURE); }

然后

strcpy(temp[index++], token);

接下来，虽然total_words可能等于temp中的字词，但您只验证了index个字数。结合原始分配时间sizeof (char)而不是sizeof (char *)，难怪会出现段错误，您尝试迭代temp中的指针列表。更好：

for (int i = 0; i < index; ++i) { for (int j = i + 1; j < index; ++j)

（同样适用于您的k循环。此外，由于您已经分配了每个temp[index]，当您使用temp[k] = temp[k+1];对指针进行随机播放时，会覆盖{{1}中的指针地址每次覆盖的指针都会导致内存泄漏。在分配之前，应该释放被覆盖的每个temp[k]。

在您更新temp[k]时，至今仍未对total_words--进行验证，如果不是index == total_words，您可以对total_words或因为结果，你不会尝试迭代未初始化的指针。

其余的看似可行，但在上面做出更改后，您应该确保不需要进行其他更改。仔细看看，如果您需要其他帮助，请告诉我。（有了MCVE，我很乐意进一步提供帮助）

其他问题

我为延迟而道歉，现实世界被称为 - 这比预期花了更长的时间，因为你所拥有的是一个尴尬的慢动作逻辑火车残骸。首先，虽然使用fread - 将整个文本文件文件读入缓冲区没有任何问题，但缓冲区不是以空值终止，因此不能与任何函数一起使用期待字符串。是的，strtok，strcpy或任何字符串函数都会读取word_data的结尾，寻找无终止字符（远远超出您不拥有的内存），从而导致一个SegFault。

你的各种分散的+1分配到你的malloc分配现在更有意义，因为看起来你正在寻找你需要添加一个额外角色的地方，以确保你可以 nul-terminated word_data，但无法确定它的去向。（别担心，我为你纠正了这一点，但这是一个很大的提示，你可能会以错误的方式解决这个问题 - 用POSIX getline或fgets阅读对于这种类型的文本处理，比文件一次更好的方法）

从字面上看，这只是代码中遇到的问题的冰山一角。如前所述，在tokenize中，您未能确认index等于total_words。鉴于您选择的delim包括ASCII撇号（或单引号），这最终会很重要。这会导致index在缓冲区遇到复数占有或收缩的任何时候超过word_count（例如，"can't"被分割为"can"和"t" ，"Peter's"分为"Peter"和"s"等等。您必须决定如何解决此问题，我现在只删除单引号。

tokenize和count_words中的逻辑难以理解，在某些方面错误，而void的返回类型（read_file）绝对没有提供表明成功（或失败）的方式。始终选择一种提供有意义信息的返回类型，您可以从中确定关键功能是成功还是失败（读取数据是合格的）。

如果它提供了回报 - 请使用它。这适用于所有可能失败的功能（包括fseek）
等功能
从0返回tokenize错过了dictionary中单词（已分配的支柱）的返回，导致您无法正确free信息并让您猜测在某个数字处显示（例如for (int i = 0; i < 333; ++i)中的main()。您需要跟踪dictionary中分配的word结构和成员tokenize的数量（保留索引，比如dindex）。然后将dindex返回main()（在代码中分配给hello），提供迭代main()中的结构所需的信息，以输出您的信息，以及在释放指针之前释放每个分配的word。

如果您没有准确计算dictionary中已分配的main()结构的数量，那么您在两个职责中的失败分配的任何内存块：（1）始终保留指向内存块的起始地址的指针，因此，（2）当它不再存在时，它可以释放需要。如果你不知道有多少块，那么你还没有完成（1）而不能做（2）。

这是一种关于样式的问题，虽然不是错误，但C的标准编码样式避免使用Initialcaps，camelCase或MixedCase变量名来支持所有< em>小写，同时保留大写名称以用于宏和常量。这是一个风格问题 - 所以它完全取决于你，但如果不遵循它可能会在某些圈子中产生错误的第一印象。

我没有继续使用另外一些段落，而是为您重新设计了您的示例并添加了一些内联评论。尽管如此，我还没有对所有的角落进行严厉的测试，但它应该是一个可靠的基础。您会注意到，count_words和tokenize已被简化。尝试并理解为什么做了什么，做了什么，并询问你是否有任何问题：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <errno.h>

typedef struct{
    char *word;
    int count;
} dictionary_t;

char *read_file (FILE *file, char **words, size_t *length)
{
    size_t size = *length = 0;

    if (fseek (file, 0, SEEK_END) == -1) {
        perror ("fseek SEEK_END");
        return NULL;
    }
    size = (size_t)ftell (file);

    if (fseek (file, 0, SEEK_SET) == -1) {
        perror ("fseek SEEK_SET");
        return NULL;
    }

    /* +1 needed to nul-terminate buffer to pass to strtok */
    if (!(*words = malloc (size + 1))) {
        perror ("malloc - size");
        return NULL;
    }

    if (fread (*words, 1, size, file) != size) {
        perror ("fread words");
        free (*words);
        return NULL;
    }

    *length = size;
    (*words)[*length] = 0;  /* nul-terminate buffer - critical */

    return *words;
}

int tokenize (dictionary_t **dictionary, char *words, int total_words)
{
    // char *delim = " .,?!:;/\"\'\n\t";    /* don't split on apostrophies */
    char *delim = " .,?!:;/\"\n\t";
    char **temp = malloc (sizeof *temp * total_words);
    char *token = strtok(words, delim);
    int index = 0, dindex = 0;

    if (!temp) {
        perror ("malloc temp");
        return -1;
    }

    if (!(*dictionary = malloc (sizeof **dictionary * total_words))) {
        perror ("malloc - dictionary");
        return -1;
    }

    while (token != NULL)
    {
        if (!(temp[index] = malloc (strlen (token) + 1))) {
            perror ("malloc - temp[index]");
            exit (EXIT_FAILURE);
        }
        strcpy(temp[index++], token);

        token = strtok (NULL, delim);
    }

    if (total_words != index) { /* validate total_words = index */
        fprintf (stderr, "error: total_words != index (%d != %d)\n", 
                total_words, index);
        /* handle error */
    }


    for (int i = 0; i < total_words; i++) {
        int found = 0, j = 0;
        for (; j < dindex; j++)
            if (strcmp((*dictionary)[j].word, temp[i]) == 0) {
                found = 1;
                break;
            }
        if (!found) {
            if (!((*dictionary)[dindex].word = malloc (strlen (temp[i]) + 1))) {
                perror ("malloc (*dictionay)[dindex].word");
                exit (EXIT_FAILURE);
            }
            strcpy ((*dictionary)[dindex].word, temp[i]);
            (*dictionary)[dindex++].count = 1;
        }
        else
            (*dictionary)[j].count++;
    }

    for (int i = 0; i < total_words; i++)
        free (temp[i]);     /* you must free storage for words */
    free (temp);            /* before freeing pointers */

    return dindex;
}

int count_words (char *words, size_t length)
{
    int count = 0;
    char previous_char = ' ';

    while (length--) {
        if (isspace (previous_char) && !isspace (*words))
            count++;
        previous_char = *words++;
    }

    return count;
}

int main (int argc, char **argv)
{
    char *word_data = NULL;
    int word_count, hello;
    size_t length = 0;
    dictionary_t *dictionary = NULL;
    FILE *input = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!input) {   /* validate file open for reading */
        fprintf (stderr, "error: file open failed '%s'.\n", argv[1]);
        return 1;
    }

    if (!read_file (input, &word_data, &length)) {
        fprintf (stderr, "error: file_read failed.\n");
        return 1;
    }
    if (input != stdin) fclose (input); /* close file if not stdin */

    word_count = count_words (word_data, length);
    printf ("wordct: %d\n", word_count);

    /* number of dictionary words returned in hello */
    if ((hello = tokenize (&dictionary, word_data, word_count)) <= 0) {
        fprintf (stderr, "error: no words or tokenize failed.\n");
        return 1;
    }

    for (int i = 0; i < hello; ++i) {
        printf("%-16s : %d\n", dictionary[i].word, dictionary[i].count);
        free (dictionary[i].word);  /* you must free word storage */
    }
    free (dictionary);  /* free pointers */

    free (word_data);   /* free buffer */

    return 0;
}

如果您还有其他问题，请与我们联系。

Answer 2

要使代码正常工作，您需要做一些事情：

修改temp的内存分配，将sizeof(char)替换为sizeof(char *)，如下所示：

char **temp = malloc(sizeof(char *) * strlen(words) + 1);
通过将dictionary替换为sizeof(Dictionary)来修复sizeof(Dictionary *)的内存分配：

*dictionary = (Dictionary*)malloc(sizeof(Dictionary *) * (*total_words));
在致电word_count时传递tokenize地址：

int hello = tokenize(&dictionary, word_data, &word_count);
将total_words函数中出现的所有tokenize替换为(*total_words)。在tokenize功能签名中，您可以将int total_words替换为int *total_words。
您还应该使用333替换for函数中main循环中word_count的硬编码值。

进行这些更改后，您的代码应按预期工作。我能够通过这些更改成功运行它。

如何从C中删除动态分配的字符串数组中的重复元素

2 个答案: