Question

我有一个.Net程序，它运行一个包含数万个相对较小的文件（每个大约10MB）的目录，计算它们的MD5哈希并将这些数据存储在SQLite数据库中。整个过程工作正常，但需要相对较长的时间（1094353ms，大约6万个文件），我正在寻找优化它的方法。以下是我想到的解决方案：

使用其他线程并同时计算多个文件的哈希值。不确定I / O速度如何限制我这个。
使用更好的散列算法。我环顾四周，我目前正在使用的那个似乎是最快的（至少在C＃上）。

哪种方法最好，哪有更好的方法？

这是我目前的代码：

private async Task<string> CalculateHash(string file, System.Security.Cryptography.MD5 md5) {
    Task<string> MD5 = Task.Run(() =>
    {
        {
            using (var stream = new BufferedStream(System.IO.File.OpenRead(file), 1200000))
                {
                    var hash = md5.ComputeHash(stream);
                    var fileMD5 = string.Concat(Array.ConvertAll(hash, x => x.ToString("X2")));

                    return fileMD5;
                }
            };
        });

        return await MD5;
    }

public async Main() {
    using (var md5 = System.Security.Cryptography.MD5.Create()) {
         foreach (var file in Directory.GetFiles(path)) {
            var hash = await CalculateHash(file, md5);

            // Adds `hash` to the database
        }
    }
}

Answer 1

创建一个工作流程，这是我知道如何创建一个管道的最简单方法，该管道使用必须是单线程的代码的两个部分和必须是多线程的部分是使用TPL Dataflow

public static class Example
{ 
    private class Dto
    {
        public Dto(string filePath, byte[] data)
        {
            FilePath = filePath;
            Data = data;
        }

        public string FilePath { get; }
        public byte[] Data { get; }
    }

    public static async Task ProcessFiles(string path)
    {
        var getFilesBlock = new TransformBlock<string, Dto>(filePath => new Dto(filePath, File.ReadAllBytes(filePath))); //Only lets one thread do this at a time.

        var hashFilesBlock = new TransformBlock<Dto, Dto>(dto => HashFile(dto), 
                new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = Environment.ProcessorCount, //We can multi-thread this part.
                                                  BoundedCapacity = 50}); //Only allow 50 byte[]'s to be waiting in the queue. It will unblock getFilesBlock once there is room.

        var writeToDatabaseBlock = new ActionBlock<Dto>(WriteToDatabase,
              new ExecutionDataflowBlockOptions {BoundedCapacity = 50});//MaxDegreeOfParallelism defaults to 1 so we don't need to specifiy it.

        //Link the blocks together.
        getFilesBlock.LinkTo(hashFilesBlock, new DataflowLinkOptions {PropagateCompletion = true});
        hashFilesBlock.LinkTo(writeToDatabaseBlock, new DataflowLinkOptions {PropagateCompletion = true});

        //Queue the work for the first block.
        foreach (var filePath in Directory.EnumerateFiles(path))
        {
            await getFilesBlock.SendAsync(filePath).ConfigureAwait(false);
        }

        //Tell the first block we are done adding files.
        getFilesBlock.Complete();

        //Wait for the last block to finish processing its last item.
        await writeToDatabaseBlock.Completion.ConfigureAwait(false);
    }

    private static Dto HashFile(Dto dto)
    {
        using (var md5 = System.Security.Cryptography.MD5.Create())
        {
            return new Dto(dto.FilePath, md5.ComputeHash(dto.Data));
        }
    }

    private static async Task WriteToDatabase(Dto arg)
    {
        //Write to the database here.
    }
}

这将创建一个包含3个段的管道。

一个是单线程的，它将文件从硬盘驱动器读入内存并存储为byte[]。

第二个可以使用最多Enviorement.ProcessorCount个线程来散列文件，它只允许50个项目位于其入站队列中，当第一个块尝试添加它时将停止处理新项目，直到下一个区块准备好接受新项目。

第三个是单线程并将数据添加到数据库中，它一次只允许50个项目的入站队列。

由于两个50限制，内存中最多有100个byte[]（hashFilesBlock队列中有50个，writeToDatabaseBlock队列中有50个，当前正在处理的项目将计入{ {1}}限制。

更新：为了好玩，我写了一个报告进度的版本，虽然未经测试但使用了C＃7功能。

BoundedCapacity

Answer 2

据我所知，Task.Run将为你在那里的每个文件实例化一个新线程，这会导致它们之间的大量线程和上下文切换。像你描述的情况，听起来像使用Parallel.For或Parallel.Foreach的好例子，像这样：

public void CalcHashes(string path)
{
    string GetFileHash(System.Security.Cryptography.MD5 md5, string fileName)
    {
        using (var stream = new BufferedStream(System.IO.File.OpenRead(fileName), 1200000))
        {
            var hash = md5.ComputeHash(stream);
            var fileMD5 = string.Concat(Array.ConvertAll(hash, x => x.ToString("X2")));

            return fileMD5;
        }
    }

    ParallelOptions options = new ParallelOptions();
    options.MaxDegreeOfParallelism = 8;

    Parallel.ForEach(filenames, options, fileName =>
    {
        using (var md5 = System.Security.Cryptography.MD5.Create())
        {
            GetFileHash(md5, fileName);
        }
    });
}

编辑：似乎Parallel.ForEach实际上并没有自动进行分区。将最大并行度限制添加到8.结果： 107005个文件 46628 ms

如何优化计算数千个文件的哈希值？

2 个答案: