Question

我有一个方法，它接受图像的文件名并处理图像（CPU密集型），然后将其上传到blob存储（异步IO）。以下是方法摘要：

public async Task<ImageJob> ProcessImage(String fileName) {

    Byte[] imageBytes = await ReadFileFromDisk( fileName ).ConfigureAwait(false); // IO-bound

    Byte[] processedImage = RunFancyAlgorithm( imageBytes ); // CPU-bound

    Uri blobUri = await this.azureBlobClient.UploadBlob( processedImage ).ConfigureAwait(false); // IO-bound

    return new ImageJob( blobUri );
}

我的程序的另一部分收到了要处理的数千个文件名的列表。

以最大限度地利用可用IO和CPU功率的方式调用ProcessImage方法的最合适方法是什么？

我已经确定了六种不同的方式（到目前为止）来调用我的方法 - 但我不确定哪种方法最好：

String[] fileNames = GetFileNames(); // typically contains thousands of filenames

// Approach 1:
{
    List<Task> tasks = fileNames 
        .Select( fileName => ProcessImage( fileName ) )
        .ToList();

    await Task.WhenAll( tasks );
}

// Approach 2:
{
    List<Task> tasks = fileNames 
        .Select( async fileName => await ProcessImage( fileName ) )
        .ToList();

    await Task.WhenAll( tasks );
}

// Approach 3:
{
    List<Task> tasks = new List<Task>();
    foreach( String fileName in fileNames )
    {
        Task imageTask = ProcessImage( fileName );
        tasks.Add( imageTask );
    }

    await Task.WhenAll( tasks );
}

// Approach 4 (Weirdly, this gives me this warning: CS4014 "Because this call is not awaited, execution of the current method continues before the call is completed. Consider applying the 'await' operator to the result of the call."
// ...even though I don't use an async lambda in the previous 3 examples, why is Parallel.ForEach so special?
{
    ParallelLoopResult parallelResult = Parallel.ForEach( fileNames, fileName => ProcessImage( fileName ) );
}

// Approach 5:
{
    ParallelLoopResult parallelResult = Parallel.ForEach( fileNames, async fileName => await ProcessImage( fileName ) );
}

// Approach 6:
{
    List<Task> tasks = fileNames
        .AsParallel()
        .Select( fileName => ProcessImage( fileName ) )
        .ToList();

    await Task.WhenAll( tasks );
}

// Approach 7:
{
    List<Task> tasks = fileNames
        .AsParallel()
        .Select( async fileName => await ProcessImage( fileName ) )
        .ToList();

    await Task.WhenAll( tasks );
}

Answer 1

听起来你需要以完全相同的方式处理许多物品。正如@StephenCleary所提到的那样TPL Dataflow对于问题类型非常有用。可以找到一个很棒的介绍here。最简单的方法是使用主TransformBlock执行ProcessImage只需几个块。这是一个简单的示例来帮助您入门：

public class ImageProcessor {

    private TransformBlock<string, ImageJob> imageProcessor;
    private ActionBlock<ImageJob> handleResults;

    public ImageProcessor() {
        var options = new ExecutionDataflowBlockOptions() {
            BoundedCapacity = 1000,
            MaxDegreeOfParallelism = Environment.ProcessorCount
        };
        imageProcessor = new TransformBlock<string, ImageJob>(fileName => ProcessImage(fileName), options);
        handleResults = new ActionBlock<ImageJob>(job => HandleResults(job), options);
        imageProcessor.LinkTo(handleResults, new DataflowLinkOptions() { PropagateCompletion = true });           
    }

    public async Task RunData() {
        var fileNames = GetFileNames();
        foreach (var fileName in fileNames) {
            await imageProcessor.SendAsync(fileName);
        }
        //all data passed into pipeline
        imageProcessor.Complete();
        await imageProcessor.Completion;
    }

    private async Task<ImageJob> ProcessImage(string fileName) {
        //Each of these steps could also be separated into discrete blocks

        var imageBytes = await ReadFileFromDisk(fileName).ConfigureAwait(false); // IO-bound

        var processedImage = RunFancyAlgorithm(imageBytes); // CPU-bound

        var blobUri = await this.azureBlobClient.UploadBlob(processedImage).ConfigureAwait(false); // IO-bound

        return new ImageJob(blobUri);
    }

    private void HandleResults(ImageJob job) {
        //do something with results
    }
}

以下哪种方法可以并行异步方法最合适？

1 个答案: