Question

我写了一个函数，它将两个大文件（file1,file2）合并到一个新文件（outputFile）中。每个文件都是基于行的格式，而条目以\ 0字节分隔。两个文件都具有相同数量的空字节。

包含两个条目的示例文件可能如下A\nB\n\0C\nZ\nB\n\0

   Input:
   file1: A\nB\0C\nZ\nB\n\0
   file2: BBA\nAB\0T\nASDF\nQ\n\0
   Output
   outputFile: A\nB\nBBA\nAB\0C\nZ\nB\nT\nASDF\nQ\n\0

FILE * outputFile = fopen(...);
setvbuf ( outputFile  , NULL , _IOFBF , 1024*1024*1024 )
FILE * file1 = fopen(...); 
FILE * file2 = fopen(...); 
int c1, c2;
while((c1=fgetc(file1)) != EOF) {
    if(c1 == '\0'){
        while((c2=fgetc(file2)) != EOF && c2 != '\0') {
            fwrite(&c2, sizeof(char), 1, outputFile);
        }
        char nullByte = '\0';
        fwrite(&nullByte, sizeof(char), 1, outputFile);
    }else{
        fwrite(&c1, sizeof(char), 1, outputFile);
    }
}

有没有办法提高此功能的IO性能？我使用outputFile将缓冲区大小setvbuf增加到1 GB。在file1和file2上使用posix_fadvise会有帮助吗？

Answer 1

你正在逐个字符地进行IO。即使使用缓冲流，S-L-O-W也将是不必要且痛苦的。

利用您的数据作为NUL终止字符串存储在文件中的事实。

假设您从每个文件交替使用以空字符结尾的字符串，并在POSIX平台上运行，因此您只需mmap()输入文件：

typedef struct mapdata
{
    const char *ptr;
    size_t bytes;
} mapdata_t;

mapdata_t mapFile( const char *filename )
{
    mapdata_t data;
    struct stat sb;

    int fd = open( filename, O_RDONLY );
    fstat( fd, &sb );

    data.bytes = sb.st_size;

    /* assumes we have a NUL byte after the file data 
       If the size of the file is an exact multiple of the
       page size, we won't have the terminating NUL byte! */
    data.ptr = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
    close( fd );
    return( data );
}

void unmapFile( mapdata_t data )
{
    munmap( data.ptr, data.bytes );
}

void mergeFiles( const char *file1, const char *file2, const char *output )
{
    char zeroByte = '\0';

    mapdata_t data1 = mapFile( file1 );
    mapdata_t data2 = mapFile( file2 );

    size_t strOffset1 = 0UL;
    size_t strOffset2 = 0UL;

    /* get a page-aligned buffer - a 64kB alignment should work */
    char *iobuffer = memalign( 64UL * 1024UL, 1024UL * 1024UL );

    /* memset the buffer to ensure the virtual mappings exist */
    memset( iobuffer, 0, 1024UL * 1024UL );

    /* use of direct IO should reduce memory pressure - the 1 MB
       buffer is already pretty large, and since we're not seeking
       the page cache is really only slowing things down */
    int fd = open( output, O_RDWR | O_TRUNC | O_CREAT | O_DIRECT, 0644 );

    FILE *outputfile = fdopen( fd, "wb" );
    setvbuf( outputfile, iobuffer, _IOFBF, 1024UL * 1024UL );

    /* loop until we reach the end of either mapped file */
    for ( ;; )
    {
        fputs( data1.ptr + strOffset1, outputfile );
        fwrite( &zeroByte, 1, 1, outputfile );

        fputs( data2.ptr + strOffset2, outputfile );
        fwrite( &zeroByte, 1, 1, outputfile );

        /* skip over the string, assuming there's one NUL
           byte in between strings */
        strOffset1 += 1 + strlen( data1.ptr + strOffset1 );
        strOffset2 += 1 + strlen( data2.ptr + strOffset2 );

        /* if either offset is too big, end the loop */
        if ( ( strOffset1 >= data1.bytes ) ||
             ( strOffset2 >= data2.bytes ) )
        {
            break;
        }
    }

    fclose( outputfile );

    unmapFile( data1 );
    unmapFile( data2 );       
}

我根本没有进行任何错误检查。您还需要添加正确的头文件。

另请注意，假设文件数据 NOT 是系统页面大小的精确倍数，从而确保在文件内容之后映射了NUL字节。如果文件的大小是页面大小的精确倍数，则必须在文件内容之后mmap()另外一页，以确保有一个NUL字节来终止最后一个字符串。

或者您可以依赖于NUL字节作为文件内容的最后一个字节。如果事实证明不是真的，你可能会得到SEGV或损坏的数据。

Answer 2

每个字符使用两个函数调用（一个用于输入，一个用于输出）函数调用很慢（它们污染了指令管道）
fgetc（）和fputc有他们的getc（）/ putc（）对应物，它们（可以）实现为宏，使编译器能够内联整个循环，除了读取/写入缓冲区，每处理512次或1024或4096个字符两次。（这些将调用系统调用，但无论如何这些都是不可避免的）
使用读/写而不是缓冲的I / O可能不值得付出努力，额外的簿记将使你的循环更胖（顺便说一句：使用fwrite（）写一个字符肯定是浪费，对于write（））
也许更大的输出缓冲区可能会有所帮助，但我不会指望它。

Answer 3

如果您要编写单个字符，则应略微改进，您应该使用fputc而不是fwrite。

此外，由于您关心速度，因此您应该尝试putc和getc而不是fputc和fgetc来查看速度是否更快。

Answer 4

如果您可以使用线程，请为file1创建一个，为file2创建另一个。

使outputFile尽可能大，然后让thread1将file1写入outputFile。

当thread2寻求outputFile输出file1 + 1的长度，并写入file2

修改

这个案例的答案不正确，但为了防止混淆，我会在这里发表。

我发现了更多关于它的错误：improve performance in file IO in C

提高在C中合并两个文件的IO性能

4 个答案: