Question

我每天都有数百万个日志文件生成，我需要阅读所有日志文件，并将它们放在一起作为一个文件，以便在其他应用程序中对其进行一些处理。

我正在寻找最快的方法。目前，我正在使用线程，任务和类似的并行方式：

Parallel.For(0, files.Length, new ParallelOptions { MaxDegreeOfParallelism = 100 }, i =>
{
    ReadFiles(files[i]);
});

void ReadFiles(string file)
{
    try
    {
        var txt = File.ReadAllText(file);
        filesTxt.Add(tmp);
    }
    catch { }
    GlobalCls.ThreadNo--;
}

或

foreach (var file in files)
{
    //Int64 index = i;
    //var file = files[index];
    while (Process.GetCurrentProcess().Threads.Count > 100)
    { 
        Thread.Sleep(100);
        Application.DoEvents();
    }
    new Thread(() => ReadFiles(file)).Start();
    GlobalCls.ThreadNo++;
    // Task.Run(() => ReadFiles(file));      
}

问题在于读取几千个文件后，读取速度越来越慢！

知道为什么吗？读取数百万个小文件的最快方法是什么？谢谢。

Answer 1

对于IO操作，CPU并行性是没有用的。 IO设备（磁盘，网络等）是您的瓶颈。如果同时读取设备，则可能会降低性能。

Answer 2

也许您仅可以使用PowerShell串联文件，例如this answer中的文件。

另一种替代方法是编写一个使用FileSystemWatcher类的程序来监视新文件，并在创建新文件时将其追加。

Answer 3

似乎您正在将所有文件的内容加载到内存中，然后再将它们写回到单个文件中。这可以解释为什么该过程随着时间的推移变得越来越慢。

一种优化过程的方法是将阅读部分与写作部分分开，并并行进行。这称为生产者-消费者模式。可以使用Parallel类，线程或任务来实现它，但是我将演示基于功能强大的TPL Dataflow library的实现，该实现特别适合于此类工作。

private static async Task MergeFiles(IEnumerable<string> sourceFilePaths,
    string targetFilePath, CancellationToken cancellationToken = default,
    IProgress<int> progress = null)
{
    var readerBlock = new TransformBlock<string, string>(async filePath =>
    {
        return File.ReadAllText(filePath); // Read the small file
        //return await File.ReadAllTextAsync(filePath); // .NET Core supports async
    }, new ExecutionDataflowBlockOptions()
    {
        MaxDegreeOfParallelism = 2, // Reading is parallelizable
        BoundedCapacity = 100, // No more than 100 file-paths buffered
        CancellationToken = cancellationToken, // Cancel at any time
    });

    StreamWriter streamWriter = null;

    int filesProcessed = 0;
    var writerBlock = new ActionBlock<string>(async text =>
    {
        await streamWriter.WriteAsync(text); // Append to the target file
        filesProcessed++;
        if (filesProcessed % 10 == 0) progress?.Report(filesProcessed);
    }, new ExecutionDataflowBlockOptions()
    {
        MaxDegreeOfParallelism = 1, // We can't parallelize the writer
        BoundedCapacity = 100, // No more than 100 file-contents buffered
        CancellationToken = cancellationToken, // Cancel at any time
    });

    readerBlock.LinkTo(writerBlock,
        new DataflowLinkOptions() { PropagateCompletion = true });

    // This is a tricky part. We use BoundedCapacity, so we must complete manually
    // the reader in case of a writer failure, otherwise a deadlock may occur.
    OnErrorComplete(writerBlock, readerBlock);

    // Open the output stream
    using (streamWriter = new StreamWriter(targetFilePath))
    {
        // Feed the reader with the file paths
        foreach (var filePath in sourceFilePaths)
        {
            var accepted = await readerBlock.SendAsync(filePath,
                cancellationToken); // Cancel at any time
            if (!accepted) break; // This will happen if the reader fails
        }
        readerBlock.Complete();
        await writerBlock.Completion;
    }

    async void OnErrorComplete(IDataflowBlock block1, IDataflowBlock block2)
    {
        await Task.WhenAny(block1.Completion); // Safe awaiting
        if (block1.Completion.IsFaulted) block2.Complete();
    }
}

用法示例：

var cts = new CancellationTokenSource();
var progress = new Progress<int>(value =>
{
    // Safe to update the UI
    Console.WriteLine($"Files processed: {value:#,0}");
});
var sourceFilePaths = Directory.EnumerateFiles(@"C:\SourceFolder", "*.log",
    SearchOption.AllDirectories); // Include subdirectories
await MergeFiles(sourceFilePaths, @"C:\AllLogs.log", cts.Token, progress);

在整个操作过程中，没有线程被阻塞。一切都是异步完成的。
BoundedCapacity用于最大程度地减少内存使用。
如果磁盘驱动器是SSD，则可以尝试使用大于2的MaxDegreeOfParallelism进行读取。
为了获得最佳性能，请与包含源文件的驱动器写入其他磁盘驱动器。
TPL数据流库可作为a package用于.NET Framework，并且是.NET Core的内置库。

使用C＃读取数百万个小文件

3 个答案: