快速Linux文件计数

Question

当有大量文件（> 100,000）时，我正试图找出找到特定目录中文件数量的最佳方法。

当有那么多文件时，执行“ls | wc -l”需要很长时间才能执行。我相信这是因为它返回了所有文件的名称。我试图占用尽可能少的磁盘IO。

我已经尝试过一些shell和Perl脚本无济于事。有什么想法吗？

Answer 1

默认情况下ls对名称进行排序，如果有很多名称，可能需要一段时间。在读取和排序所有名称之前，也不会输出。使用ls -f选项关闭排序。

ls -f | wc -l

请注意，这也会启用-a，因此.，..以及以.开头的其他文件将被计算在内。

Answer 2

最快的方法是专门制作的程序，如下所示：

#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count = 0;

    dir = opendir(argv[1]);

    while((ent = readdir(dir)))
            ++count;

    closedir(dir);

    printf("%s contains %ld files\n", argv[1], count);

    return 0;
}

从我的测试中不考虑缓存，我对这个目录中的每一个都进行了大约50次，一遍又一遍，以避免基于缓存的数据偏斜，并且我得到了大致以下的性能数字（在实际时钟时间内））：

ls -1  | wc - 0:01.67
ls -f1 | wc - 0:00.14
find   | wc - 0:00.22
dircnt | wc - 0:00.04

最后一个dircnt是从上述来源编译的程序。

编辑2016-09-26

由于受欢迎的需求，我已经重新编写了这个程序以便递归，因此它会进入子目录并继续单独计算文件和目录。

由于很明显有些人想知道如何做这一切，我在代码中有很多评论，试图让它显而易见的是。我写了这个并在64位Linux上进行了测试，但它应在任何符合POSIX标准的系统上运行，包括Microsoft Windows。欢迎提供错误报告;如果您无法在AIX或OS / 400或其他任何设备上运行，我很乐意更新此信息。

正如您所看到的，它的很多比原始的更复杂，必然如此：除非您希望代码变得非常复杂，否则必须至少存在一个函数才能递归调用（例如，管理子目录堆栈并在单个循环中处理它。由于我们必须检查文件类型，不同操作系统，标准库等之间的差异开始发挥作用，所以我编写了一个程序，试图在任何可以编译的系统上使用。

错误检查非常少，而count函数本身并未真正报告错误。唯一真正失败的调用是opendir和stat（如果您不幸运并且系统中dirent已包含文件类型）。我对检查子路径名的总长度没有特权，但从理论上讲，系统不应允许任何长度超过PATH_MAX的路径名。如果有问题，我可以解决这个问题，但只需要向学习写C的人解释更多代码。该程序旨在成为如何递归潜入子目录的示例。

#include <stdio.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/stat.h>

#if defined(WIN32) || defined(_WIN32) 
#define PATH_SEPARATOR '\\' 
#else
#define PATH_SEPARATOR '/' 
#endif

/* A custom structure to hold separate file and directory counts */
struct filecount {
  long dirs;
  long files;
};

/*
 * counts the number of files and directories in the specified directory.
 *
 * path - relative pathname of a directory whose files should be counted
 * counts - pointer to struct containing file/dir counts
 */
void count(char *path, struct filecount *counts) {
    DIR *dir;                /* dir structure we are reading */
    struct dirent *ent;      /* directory entry currently being processed */
    char subpath[PATH_MAX];  /* buffer for building complete subdir and file names */
    /* Some systems don't have dirent.d_type field; we'll have to use stat() instead */
#if !defined ( _DIRENT_HAVE_D_TYPE )
    struct stat statbuf;     /* buffer for stat() info */
#endif

/* fprintf(stderr, "Opening dir %s\n", path); */
    dir = opendir(path);

    /* opendir failed... file likely doesn't exist or isn't a directory */
    if(NULL == dir) {
        perror(path);
        return;
    }

    while((ent = readdir(dir))) {
      if (strlen(path) + 1 + strlen(ent->d_name) > PATH_MAX) {
          fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + 1 + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name);
          return;
      }

/* Use dirent.d_type if present, otherwise use stat() */
#if defined ( _DIRENT_HAVE_D_TYPE )
/* fprintf(stderr, "Using dirent.d_type\n"); */
      if(DT_DIR == ent->d_type) {
#else
/* fprintf(stderr, "Don't have dirent.d_type, falling back to using stat()\n"); */
      sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
      if(lstat(subpath, &statbuf)) {
          perror(subpath);
          return;
      }

      if(S_ISDIR(statbuf.st_mode)) {
#endif
          /* Skip "." and ".." directory entries... they are not "real" directories */
          if(0 == strcmp("..", ent->d_name) || 0 == strcmp(".", ent->d_name)) {
/*              fprintf(stderr, "This is %s, skipping\n", ent->d_name); */
          } else {
              sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
              counts->dirs++;
              count(subpath, counts);
          }
      } else {
          counts->files++;
      }
    }

/* fprintf(stderr, "Closing dir %s\n", path); */
    closedir(dir);
}

int main(int argc, char *argv[]) {
    struct filecount counts;
    counts.files = 0;
    counts.dirs = 0;
    count(argv[1], &counts);

    /* If we found nothing, this is probably an error which has already been printed */
    if(0 < counts.files || 0 < counts.dirs) {
        printf("%s contains %ld files and %ld directories\n", argv[1], counts.files, counts.dirs);
    }

    return 0;
}

编辑2017-01-17

我已经合并了@FlyingCodeMonkey建议的两个更改：

使用lstat代替stat。如果您正在扫描的目录中有符号链接目录，这将更改程序的行为。以前的行为是（链接的）子目录将其文件计数添加到总计数中;新行为是链接目录将计为单个文件，其内容将不计算在内。
如果文件的路径太长，将发出错误消息，程序将停止。

编辑2017-06-29

运气好的话，这将是这个答案的 last 编辑：）

我已将此代码复制到GitHub repository中，以便更轻松地获取代码（而不是复制/粘贴，只需download the source），此外它还可以更轻松任何人都可以通过提交GitHub的拉取请求来建议修改。

源代码在Apache License 2.0下可用。补丁^*欢迎！

＆＃34;膜片＆＃34;是像我这样的老人称之为＆＃34;拉请求＆＃34;。

Answer 3

你试过吗？例如：

find . -name "*.ext" | wc -l

Answer 4

find，ls和perl针对40,000个文件进行了测试：速度相同（虽然我没有尝试清除缓存）：

[user@server logs]$ time find . | wc -l
42917

real    0m0.054s
user    0m0.018s
sys     0m0.040s
[user@server logs]$ time /bin/ls -f | wc -l
42918

real    0m0.059s
user    0m0.027s
sys     0m0.037s

并使用perl opendir / readdir，同时：

[user@server logs]$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"'
42918

real    0m0.057s
user    0m0.024s
sys     0m0.033s

注意：我使用/ bin / ls -f确保绕过alias选项，可能慢一点，-f避免文件排序。没有-f的ls比find / perl慢两倍除非ls与-f一起使用，它似乎是同一时间：

[user@server logs]$ time /bin/ls . | wc -l
42916

real    0m0.109s
user    0m0.070s
sys     0m0.044s

我也希望有一些脚本直接询问文件系统而不需要所有不必要的信息。

根据Peter van der Heijden，glenn jackman和mark4o的答案进行测试。

托马斯

Answer 5

你可以根据你的要求改变输出，但这里有一个bash单行程序，我写的是递归计算并报告一系列数字命名目录中的文件数。

dir=/tmp/count_these/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$i => $(find ${dir}${i} -type f | wc -l),"; }

以递归方式查看给定目录中的所有文件（而不是目录），并以类似哈希的格式返回结果。对find命令的简单调整可能会使您想要的文件类型更加具体等等。

结果如下：

1 => 38,
65 => 95052,
66 => 12823,
67 => 10572,
69 => 67275,
70 => 8105,
71 => 42052,
72 => 1184,

Answer 6

令我惊讶的是，一个简单的发现与ls -f

非常相似

> time ls -f my_dir | wc -l
17626

real    0m0.015s
user    0m0.011s
sys     0m0.009s

与

> time find my_dir -maxdepth 1 | wc -l
17625

real    0m0.014s
user    0m0.008s
sys     0m0.010s

当然，每次执行任何这些时，小数点后三位的值都会移动一点，所以它们基本相同。但是请注意find返回一个额外的单位，因为它会计算实际目录本身（并且如前所述，ls -f会返回两个额外的单位，因为它也会计算。和..）。

Answer 7

为了完整起见，只需添加它。当然，其他人已经发布了正确答案，但您也可以使用树程序获取文件和目录的数量。

运行命令tree | tail -n 1以获取最后一行，其中会出现类似“763目录，9290文件”的内容。这会递归计算文件和文件夹，不包括隐藏文件，可以使用标记-a添加。作为参考，我的计算机花了4.8秒，树计算我的整个家庭目录，这是24777个目录，238680个文件。 find -type f | wc -l花了5.3秒，半秒钟，所以我认为树速度非常快。

只要您没有任何子文件夹，树就可以快速简便地计算文件。

另外，纯粹为了它的乐趣，您可以使用tree | grep '^├'仅显示当前目录中的文件/文件夹 - 这基本上是ls的慢得多的版本。

Answer 8

您可以尝试使用opendir() readdir()中的Perl更快。有关这些函数的示例，请查看here

Answer 9

快速Linux文件计数

我知道的最快的linux文件数

locate -c -r '/home'

没有需要调用grep！但如上所述，您应该有一个新的数据库（每天由cron作业更新，或由sudo updatedb手动更新）。

来自 man locate

-c, --count
    Instead  of  writing  file  names on standard output, write the number of matching
    entries only.

附加您应该知道它还将目录计为文件！

BTW：如果您想要了解系统类型上的文件和目录

locate -S

它输出目录，文件等的数量。

Answer 10

对于非常大的，非常嵌套的目录，这里的答案比本页几乎所有其他内容都要快：

https://serverfault.com/a/691372/84703

locate -r '.' | grep -c "^$PWD"

Answer 11

我试图计算~10K文件夹数据集中的文件，每个文件大约有10K个文件。许多方法的问题在于它们隐含地统计了100M文件，这需要很长时间。

我冒昧地扩展the approach by christopher-schultz所以它支持通过args传递目录（他的递归方法也使用stat）。

将以下内容放入档案dircnt_args.c：

#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count;
    long countsum = 0;
    int i;

    for(i=1; i < argc; i++) {
        dir = opendir(argv[i]);
        count = 0;
        while((ent = readdir(dir)))
            ++count;

        closedir(dir);

        printf("%s contains %ld files\n", argv[i], count);
        countsum += count;
    }
    printf("sum: %ld\n", countsum);

    return 0;
}

在gcc -o dircnt_args dircnt_args.c之后你可以像这样调用它：

dircnt_args /your/dirs/*

在10K文件夹中的100M文件中，上述操作很快完成（首次运行约5分钟，缓存后续操作：~23秒）。

在不到一个小时内完成的唯一其他方法是缓存大约1分钟：ls -f /your/dirs/* | wc -l。每个目前的计数都被几个新线所取消......

除了预期之外，我在{1}}内的所有尝试都没有在一小时内返回： - /

Answer 12

在这里写这个，因为我没有足够的声望指向评论的答案，但我可以留下我自己的答案，这不会感。总之...

关于answer by Christopher Schultz，我建议将 stat 更改为 lstat ，并可能添加边界检查以避免缓冲区溢出：

if (strlen(path) + strlen(PATH_SEPARATOR) + strlen(ent->d_name) > PATH_MAX) {
    fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + strlen(PATH_SEPARATOR) + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name);
    return;
}

使用lstat的建议是避免使用符号链接，如果目录包含父目录的符号链接，则会导致循环。

Answer 13

ls花费更多时间对文件名进行排序，使用-f禁用排序会节省一些时间：

ls -f | wc -l

或者您可以使用find：

find . -type f | wc -l

Answer 14

Linux上最快的方法（问题标记为linux），就是使用直接系统调用。这是一个计算目录中文件（仅限，没有dirs）的小程序。你可以计算数百万个文件，它比“ls -f”快2.5倍，比Christopher Schultz的答案快1.3-1.5倍。

#define _GNU_SOURCE
#include <dirent.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/syscall.h>

#define BUF_SIZE 4096

struct linux_dirent {
    long d_ino;
    off_t d_off;
    unsigned short d_reclen;
    char d_name[];
};

int countDir(char *dir) {


    int fd, nread, bpos, numFiles = 0;
    char d_type, buf[BUF_SIZE];
    struct linux_dirent *dirEntry;

    fd = open(dir, O_RDONLY | O_DIRECTORY);
    if (fd == -1) {
        puts("open directory error");
        exit(3);
    }
    while (1) {
        nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
        if (nread == -1) {
            puts("getdents error");
            exit(1);
        }
        if (nread == 0) {
            break;
        }

        for (bpos = 0; bpos < nread;) {
            dirEntry = (struct linux_dirent *) (buf + bpos);
            d_type = *(buf + bpos + dirEntry->d_reclen - 1);
            if (d_type == DT_REG) {
                // Increase counter
                numFiles++;
            }
            bpos += dirEntry->d_reclen;
        }
    }
    close(fd);

    return numFiles;
}

int main(int argc, char **argv) {

    if (argc != 2) {
        puts("Pass directory as parameter");
        return 2;
    }
    printf("Number of files in %s: %d\n", argv[1], countDir(argv[1]));
    return 0;
}

PS：它不是递归的，但你可以修改它来实现它。

Answer 15

我意识到，当你拥有大量数据时，不要在内存处理中使用，而不是使用＃34; pipe＆＃34;命令。所以我将结果保存到文件中并在分析后

ls -1 /path/to/dir > count.txt && cat count.txt | wc -l

Answer 16

您应该使用“getdents”代替ls / find

这是一篇非常好的文章，描述了getdents的方法。

http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html

以下是摘录：

ls和几乎所有其他列出目录的方法（包括python os.listdir，find。）都依赖于libc readdir（）。但是readdir（）一次只能读取32K的目录条目，这意味着如果你在同一目录中有很多文件（即500M的目录条目），那么读取所有目录条目将花费很长的时间。，特别是在慢速磁盘上。对于包含大量文件的目录，您需要比依赖readdir（）的工具更深入地挖掘。您需要直接使用getdents（）系统调用，而不是libc中的辅助方法。

我们可以找到使用here中的getdents（）列出文件的C代码：

为了快速列出目录中的所有文件，您需要进行两项修改。

首先，将缓冲区大小从X增加到5兆字节。

#define BUF_SIZE 1024*1024*5

然后修改主循环，在那里打印出有关目录中每个文件的信息，以跳过带有inode == 0的条目。我这样做是通过添加

if (dp->d_ino != 0) printf(...);

在我的情况下，我也只关心目录中的文件名，所以我也重写了printf（）语句只打印文件名。

if(d->d_ino) printf("%sn ", (char *) d->d_name);

编译它（它不需要任何外部库，因此它非常简单）

gcc listdir.c -o listdir

现在运行

./listdir [directory with insane number of files]

Answer 17

前10名具有最高档案的董事。

dir=/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$(find ${dir}${i} \
    -type f | wc -l) => $i,"; } | sort -nr | head -10

Answer 18

我更喜欢以下命令来跟踪目录中文件数量的变化。

watch -d -n 0.01 'ls | wc -l'

该命令将使窗口保持打开状态，以跟踪目录中文件的数量，刷新速度为0.1秒。

快速Linux文件计数用于大量文件

18 个答案:

快速Linux文件计数