Question

我正在编写一个程序，它基本上搜索目录及其所有子目录中的重复文件。我已经根据你的建议改进了问题和代码（需要返回默认值的函数已经修复了）所以这里就是......

以下是比较功能的代码：

int compare()
{
    int a, b;
    unsigned char byte1, byte2;

    while(1)
    {
        a = fread(&byte1, 1, 1, file1);
        b = fread(&byte2, 1, 1, file2);
        if(a == 0 && b == 0) break;
        if(a != b) return 1;
        if(byte2 != byte1) return 1;
    }

    return 0;
}

void startCompare()
{
    char path1[1000], path2[1000];
    FILE *reference = fopen("list.comp", "r");
    FILE *other = fopen("list2.comp", "r");
    int i, flag, j;
    i = 0;

    while(fgets(path1, 1000, reference))
    {
        flag = 0;
        strtok(path1, "\n");  
        openFile1(path1);
        for(j = 0; j <= i; ++j)
        {
            fgets(path2, 1000, other);
        }
        while(fgets(path2, 1000, other))
        {
            strtok(path2, "\n");
            openFile2(path2);
            if(!compare())
            {
                printf("Checking: %s vs. %s --> DUPLICATE\n", path1, path2);
                flag = 1;
                break;
            }
            else
            {
                printf("Checking: %s vs. %s --> DIFFERENT\n", path1, path2);
            }
        }
        if(flag == 1)
        {
            printf("Will be deleted.\n");
        }
    }
}

（首先调用startCompare（）函数）

现在，目录本身包含以下文件：

bloblo
bloblo / frofo
bloblo / frofo /新文件夹
bloblo / frofo /新文件夹（2）
bloblo / frofo /新文件夹（2）/新文件夹（3）
bloblo / frofo /新文件夹（2）/新文件夹（3）/新文本Document.txt
bloblo / frofo /新文件夹（2）/新文件夹（3）/ Untitled4
0.comp
1.comp
2.comp
3.comp
4.comp
5.comp
11.comp
100.comp
duplicate_delete.dev
duplicate_delete.exe
duplicate_delete.layout
list.comp
list2.comp
的main.c
main.o
Makefile.win
Untitled5.c
Untitled5.exe

输出结果为：

Checking: 0.comp vs. 1.comp --> DIFFERENT
Checking: 0.comp vs. 100.comp --> DIFFERENT
Checking: 0.comp vs. 11.comp --> DIFFERENT
Checking: 0.comp vs. 2.comp --> DIFFERENT
Checking: 0.comp vs. 3.comp --> DIFFERENT
Checking: 0.comp vs. 4.comp --> DIFFERENT
Checking: 0.comp vs. 5.comp --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.dev --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.exe --> DIFFERENT
Checking: 0.comp vs. duplicate_delete.layout --> DIFFERENT
Checking: 0.comp vs. list.comp --> DIFFERENT
Checking: 0.comp vs. list2.comp --> DIFFERENT
Checking: 0.comp vs. main.c --> DIFFERENT
Checking: 0.comp vs. main.o --> DIFFERENT
Checking: 0.comp vs. Makefile.win --> DIFFERENT
Checking: 0.comp vs. Untitled5.c --> DIFFERENT
Checking: 0.comp vs. Untitled5.exe --> DIFFERENT

返回码为0。

虽然它应该打印的是每个文件相互检查并发现文件100.comp和11.comp是彼此的副本而其他文件是唯一的。基本上，为什么它会停在那里？为什么不继续检查？有什么方法可以解决这个问题吗？

Answer 1

我不知道这是否会回答您的TLDR问题和代码，但这对评论来说太过分了。

您的函数compare()永远不会返回0，如果您已启用并注意到编译器警告，您就会知道这一点。该函数还使用了可怕的feof()。见why feof() is wrong

我建议更换这个

int compare()
{
    while(!feof(file1))
    {
        fread(&byte1, sizeof(unsigned char), 1, file1);
        fread(&byte2, sizeof(unsigned char), 1, file2);
        if(byte2 != byte1) return 1;
    }
    if(feof(file1) && (!feof(file2))) return 1;
    if(feof(file2) && (!feof(file1))) return 1;
}

有了这个，因为检查fread()读取的数据量是测试文件结尾的方法。

int compare()
// return 0 if files are the same
// *** always include a comment to tell you what the function does / returns ***
{
    size_t read1, read2;
    while(1) {
        read1 = fread(&byte1, 1, 1, file1);
        read2 = fread(&byte2, 1, 1, file2);
        if (read1==0 && read2==0)
            break;             // success: both files ended
        if (read1 != read2)
            return 1;          // bad: one of them read, other didn't
        if (byte2 != byte1)
            return 1;          // bad: files read different data
    }
    return 0;
}

并注意sizeof(unsigned char)是非常不必要的，它是1。

我也会将byte1和byte2作为局部变量。

Answer 2

回答你实际问过的问题（为什么不比较所有文件对）：

您的程序首先从list.comp读取第一行。然后，它从list2.comp读取每一行并比较文件。

然后，它会从list.comp读取下一行，并尝试再次将其与list2.comp中的所有文件进行比较。但它已经在list2.comp的末尾，因此它不会再从list2.comp读取任何文件名。

您可以使用rewind(other);来＆＃34;倒带＆＃34;回到list2.comp的开头，这样你就可以再次阅读文件名。

Answer 3

很抱歉，如果您要搜索重复文件，则使用效率低下的方法。最好对它们运行 md5sum（1）并对生成的列表进行排序。然后，对于相等的md5值，您将比较一行与下一行。如果您不信任 md5sum（1）（有人说不同的文件可以提供相同的校验和），您可以只比较文件内容（但仅限于已经匹配的md5校验和）。到目前为止，这比您的方法更有效。它可以通过以下方式解决：

find <dir> -type f -name "glob_pattern" -print0 | xargs -0 md5sum | sort >files.md5sum

然后，编辑文件files.md5sum并搜索/^\([0-9a-f]*\) .*\n\1/模式以获取重复的md5。你甚至可以多次重复同一个文件。

注意

请注意空文件具有所有相同的MD5校验和，并且所有文件都相等。你也会在你的列表中得到这个。

fgets（）不读取整个文件

3 个答案:

注意