读写并行任务

时间:2013-10-26 13:13:44

标签: c# c#-4.0 task-parallel-library

寻找从数据源读取的最佳方法,例如Azure表存储,这很耗时并将数据转换为json或csv,并根据分区键写入具有文件名的本地文件。
正在考虑的一种方法是以固定的时间间隔在定时器已用事件触发器上运行写入文件任务。

1 个答案:

答案 0 :(得分:3)

对于那些不能很好地平行的事情(比如I / O),最好的办法就是使用“Producer-Consumer模型”。

它的工作方式是你有一个线程处理非parallizeable任务,所有的任务都被读入缓冲区。然后,您有一组并行任务,这些任务都从缓冲区读取并处理数据,然后在处理完数据后将数据放入另一个缓冲区。如果您需要以非可并行的方式再次写出结果,那么您将有另一个单独的任务写出结果。

public Stream ProcessData(string filePath)
{
    using(var sourceCollection = new BlockingCollection<string>())
    using(var destinationCollection = new BlockingCollection<SomeClass>())
    {
        //Create a new background task to start reading in the file
        Task.Factory.StartNew(() => ReadInFile(filePath, sourceCollection), TaskCreationOptions.LongRunning);

        //Create a new background task to process the read in lines as they come in
        Task.Factory.StartNew(() => TransformToClass(sourceCollection, destinationCollection), TaskCreationOptions.LongRunning);

        //Process the newly created objects as they are created on the same thread that we originally called the function with
        return TrasformToStream(destinationCollection);
    }
}

private static void ReadInFile(string filePath, BlockingCollection<string> collection)
{
    foreach(var line in File.ReadLines(filePath))
    {
        collection.Add(line);
    }

    //This lets the consumer know that we will not be adding any more items to the collection.
    collection.CompleteAdding();
}

private static void TransformToClass(BlockingCollection<string> source, BlockingCollection<SomeClass> dest)
{
    //GetConsumingEnumerable() will take items out of the collection and block the thread if there are no items available and CompleteAdding() has not been called yet.
    Parallel.ForEeach(source.GetConsumingEnumerable(), 
                      (line) => dest.Add(SomeClass.ExpensiveTransform(line));

    dest.CompleteAdding();
}

private static Stream TrasformToStream(BlockingCollection<SomeClass> source)
{
    var stream = new MemoryStream();
    foreach(var record in source.GetConsumingEnumerable())
    {
        record.Seralize(stream);
    }
    return stream;
}

我强烈建议您阅读免费书籍Patterns for Parallel Programming,它会详细介绍这一点。整个部分详细解释了生产者 - 消费者模型。

更新:对于小型性能启动,请使用GetConsumingPartitioner()循环中GetConsumingEnumerable()的{​​{1}}代替Parallel.ForEachForEach对传递的IEnumerable进行了一些假设,使得它不需要额外的锁定,通过传递分区而不是枚举,它不需要采取那些额外的锁