Question

我有一个处理大量文件的过程（~96,000个文件，~12 TB数据）。该过程的几次运行使文件分散在驱动器周围。该过程中的每次迭代都使用多个文件。这导致收集文件的磁盘周围发生了大量的抨击。

理想情况下，我希望该过程按顺序编写它使用的文件，以便下一次运行将按顺序读取它们（文件大小更改）。有没有办法暗示物理排序/分组，而不是写入原始分区？

任何其他建议都会有所帮助。

由于

Answer 1

您可能会查找两个系统调用：fadvise64，fallocate告诉内核您打算如何读取或写入给定文件。

另一个提示是“Orlov块分配器”（Wikipedia，LWN）会影响内核分配新目录和文件条目的方式。

Answer 2

最后我决定不担心以任何特定顺序编写文件。相反，在我开始运行之前，我会找出每个文件的第一个块所在的位置，然后按第一个块位置对文件处理顺序进行排序。不完美，但它确实在处理时间上有很大的不同。

这里是我用来获取第一个提供文件列表块的C代码我是根据我在网上找到的示例代码对其进行调整的（似乎找不到原始来源）。

#include <stdio.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>

#include <linux/fs.h>

//
// Get the first block for each file passed to stdin,
// write filename & first block for each file to stdout
//


int main(int argc, char **argv) {
    int     fd;
    int     block;
    char fname[512];

    while(fgets(fname, 511, stdin) != NULL) {

        fname[strlen(fname) - 1] = '\0';
        assert(fd=open(fname, O_RDONLY));

        block = 0;
        if (ioctl(fd, FIBMAP, &block)) {
            printf("FIBMAP ioctl failed - errno: %s\n", strerror(errno));
        }
        printf("%010d, %s\n", block, fname);
        close(fd);
    }
    return 0;
}

在linux分区上排序文件位置

2 个答案: