I am trying to implement a data processing pipeline using cents[i]=100*(amounts[i]-dollars[i]);
. However, I am relatively new to dataflow and not completely sure how to use it properly for the problem I am trying to solve.
Problem:
I am trying to iterate through the list of files and process each file to read some data and then further process that data. Each file is roughly TPL Dataflow
to 700MB
in size. Each file contains 1GB
data. In order to process these files in parallel and not run of of memory, I am trying to use JSON
with IEnumerable<>
and then further process the data.
Once I get list of files, I want to process maximum 4-5 files at a time in parallel. My confusion comes from:
yield return
and IEnumerable<>
with yeild return
and dataflow. Came across this answer by svick, but still not sure how to convert async/await
to IEnumerable<>
and then link all blocks together and track completion.ISourceBlock
will be really fast (going through list of files), but producer
will be very slow (processing each file - read data, deserialize consumer
). In this case, how to track completion.JSON
feature of datablocks to connect various blocks? or use method such as LinkTo
and OutputAvailableAsync()
to propagate data from one block to another.Code:
ReceiveAsync()
In the above code, I am not using private const int ProcessingSize= 4;
private BufferBlock<string> _fileBufferBlock;
private ActionBlock<string> _processingBlock;
private BufferBlock<DataType> _messageBufferBlock;
public Task ProduceAsync()
{
PrepareDataflow(token);
var bufferTask = ListFilesAsync(_fileBufferBlock, token);
var tasks = new List<Task> { bufferTask, _processingBlock.Completion };
return Task.WhenAll(tasks);
}
private async Task ListFilesAsync(ITargetBlock<string> targetBlock, CancellationToken token)
{
...
// Get list of file Uris
...
foreach(var fileNameUri in fileNameUris)
await targetBlock.SendAsync(fileNameUri, token);
targetBlock.Complete();
}
private async Task ProcessFileAsync(string fileNameUri, CancellationToken token)
{
var httpClient = new HttpClient();
try
{
using (var stream = await httpClient.GetStreamAsync(fileNameUri))
using (var sr = new StreamReader(stream))
using (var jsonTextReader = new JsonTextReader(sr))
{
while (jsonTextReader.Read())
{
if (jsonTextReader.TokenType == JsonToken.StartObject)
{
try
{
var data = _jsonSerializer.Deserialize<DataType>(jsonTextReader)
await _messageBufferBlock.SendAsync(data, token);
}
catch (Exception ex)
{
_logger.Error(ex, $"JSON deserialization failed - {fileNameUri}");
}
}
}
}
}
catch(Exception ex)
{
// Should throw?
// Or if converted to block then report using Fault() method?
}
finally
{
httpClient.Dispose();
buffer.Complete();
}
}
private void PrepareDataflow(CancellationToken token)
{
_fileBufferBlock = new BufferBlock<string>(new DataflowBlockOptions
{
CancellationToken = token
});
var actionExecuteOptions = new ExecutionDataflowBlockOptions
{
CancellationToken = token,
BoundedCapacity = ProcessingSize,
MaxMessagesPerTask = 1,
MaxDegreeOfParallelism = ProcessingSize
};
_processingBlock = new ActionBlock<string>(async fileName =>
{
try
{
await ProcessFileAsync(fileName, token);
}
catch (Exception ex)
{
_logger.Fatal(ex, $"Failed to process fiel: {fileName}, Error: {ex.Message}");
// Should fault the block?
}
}, actionExecuteOptions);
_fileBufferBlock.LinkTo(_processingBlock, new DataflowLinkOptions { PropagateCompletion = true });
_messageBufferBlock = new BufferBlock<DataType>(new ExecutionDataflowBlockOptions
{
CancellationToken = token,
BoundedCapacity = 50000
});
_messageBufferBlock.LinkTo(DataflowBlock.NullTarget<DataType>());
}
and IEnumerable<DataType>
as I cannot use it with yield return
. So I am linking input buffer to async/await
which in turn posts to another queue. However by using ActionBlock<DataType>
, I cannot link it to next block for processing and have to manually ActionBlock<>
from Post/SendAsync
to ActionBlock<>
. Also, in this case, not sure, how to track completion.
This code works, but, I am sure there could be better solution then this and I can just link all the block (instead of BufferBlock<>
and then sending messages from it to ActionBlock<DataType>
)
Another option could be to convert BufferBlock<DataType>
to IEnumerable<>
using IObservable<>
, but again I am not much familiar with Rx
and don't know exactly how to mix Rx
and TPL Dataflow
答案 0 :(得分:8)
问题1
您可以直接在使用者块上使用IEnumerable<T>
或Post
将SendAsync
生成器插入您的TPL数据流链中,如下所示:
foreach (string fileNameUri in fileNameUris)
{
await _processingBlock.SendAsync(fileNameUri).ConfigureAwait(false);
}
您也可以使用BufferBlock<TInput>
,但在您的情况下,它实际上似乎没有必要(甚至是有害的 - 请参阅下一部分)。
问题2
您希望何时更改SendAsync
而不是Post
?如果您的生产者运行速度超过了可以处理的URI(并且您已经指明了这种情况),并且您选择将_processingBlock
设为BoundedCapacity
,那么当块的内部缓冲区达到指定的容量,您的SendAsync
将&#34;挂起&#34;直到缓冲槽释放,并且foreach
循环将被限制。这种反馈机制可以产生背压并确保您不会耗尽内存。
问题3
您绝对应该使用LinkTo
方法链接大多数案例中的块。不幸的是,由于IDisposable
和非常大(可能)序列的相互作用,你的情况很糟糕。因此,您的完成将在缓冲区和处理块之间自动流动(由于LinkTo
),但在此之后 - 您需要手动传播它。这很棘手,但可行。
我将用#34; Hello World&#34;来说明这一点。生成器迭代每个字符并且使用者(实际上很慢)将每个字符输出到Debug窗口的示例。
注意:LinkTo
不存在。
// REALLY slow consumer.
var consumer = new ActionBlock<char>(async c =>
{
await Task.Delay(100);
Debug.Print(c.ToString());
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
var producer = new ActionBlock<string>(async s =>
{
foreach (char c in s)
{
await consumer.SendAsync(c);
Debug.Print($"Yielded {c}");
}
});
try
{
producer.Post("Hello world");
producer.Complete();
await producer.Completion;
}
finally
{
consumer.Complete();
}
// Observe combined producer and consumer completion/exceptions/cancellation.
await Task.WhenAll(producer.Completion, consumer.Completion);
输出:
Yielded H H Yielded e e Yielded l l Yielded l l Yielded o o Yielded Yielded w w Yielded o o Yielded r r Yielded l l Yielded d d
从上面的输出可以看出,生产者受到限制,块之间的切换缓冲区永远不会变得太大。
修改强>
您可能会发现通过
传播完成更加清晰producer.Completion.ContinueWith(
_ => consumer.Complete(), TaskContinuationOptions.ExecuteSynchronously
);
...在producer
定义之后。这允许你轻微减少生产者/消费者的耦合 - 但最后你仍然需要记住观察Task.WhenAll(producer.Completion, consumer.Completion)
。
答案 1 :(得分:7)
为了并行处理这些文件而不运行内存,我试图使用IEnumerable&lt;&gt;收益率返回,然后进一步处理数据。
我不相信这一步是必要的。你实际上在这里避免的只是一个文件名列表。即使您有百万的文件,文件名列表也不会占用大量内存。
我将输入缓冲区链接到ActionBlock,而ActionBlock又发布到另一个队列。但是,通过使用ActionBlock&lt;&gt;,我无法将其链接到下一个块进行处理,并且必须从ActionBlock手动Post / SendAsync&lt;&gt;到BufferBlock&lt;&gt ;.此外,在这种情况下,不确定,如何跟踪完成。
ActionBlock<TInput>
是“行尾”块。它只接受输入并且不产生任何输出。在您的情况下,您不希望ActionBlock<TInput>
;你想要TransformManyBlock<TInput, TOutput>
,它接受输入,在它上面运行一个函数,并产生输出(每个输入项有任意数量的输出项)。
要记住的另一点是所有缓冲区块都有一个输入缓冲区。所以额外的BufferBlock
是不必要的。
最后,如果您已经处于“数据流地”,通常最好以实际执行某项操作的数据流块结束(例如ActionBlock
而不是BufferBlock
)。在这种情况下,可以使用BufferBlock
作为有界生产者/消费者队列,其中一些其他代码正在消耗结果。就个人而言,我认为可能更清洁,将消费代码重写为ActionBlock
的动作,但保持消费者独立于数据流也可能更清晰。对于下面的代码,我离开了最后一个有界BufferBlock
,但如果您使用此解决方案,请考虑将最终块更改为有界ActionBlock
。
private const int ProcessingSize= 4;
private static readonly HttpClient HttpClient = new HttpClient();
private TransformBlock<string, DataType> _processingBlock;
private BufferBlock<DataType> _messageBufferBlock;
public Task ProduceAsync()
{
PrepareDataflow(token);
ListFiles(_fileBufferBlock, token);
_processingBlock.Complete();
return _processingBlock.Completion;
}
private void ListFiles(ITargetBlock<string> targetBlock, CancellationToken token)
{
... // Get list of file Uris, occasionally calling token.ThrowIfCancellationRequested()
foreach(var fileNameUri in fileNameUris)
_processingBlock.Post(fileNameUri);
}
private async Task<IEnumerable<DataType>> ProcessFileAsync(string fileNameUri, CancellationToken token)
{
return Process(await HttpClient.GetStreamAsync(fileNameUri), token);
}
private IEnumerable<DataType> Process(Stream stream, CancellationToken token)
{
using (stream)
using (var sr = new StreamReader(stream))
using (var jsonTextReader = new JsonTextReader(sr))
{
while (jsonTextReader.Read())
{
token.ThrowIfCancellationRequested();
if (jsonTextReader.TokenType == JsonToken.StartObject)
{
try
{
yield _jsonSerializer.Deserialize<DataType>(jsonTextReader);
}
catch (Exception ex)
{
_logger.Error(ex, $"JSON deserialization failed - {fileNameUri}");
}
}
}
}
}
private void PrepareDataflow(CancellationToken token)
{
var executeOptions = new ExecutionDataflowBlockOptions
{
CancellationToken = token,
MaxDegreeOfParallelism = ProcessingSize
};
_processingBlock = new TransformManyBlock<string, DataType>(fileName =>
ProcessFileAsync(fileName, token), executeOptions);
_messageBufferBlock = new BufferBlock<DataType>(new DataflowBlockOptions
{
CancellationToken = token,
BoundedCapacity = 50000
});
}
或者,您可以使用Rx。学习Rx可能会非常困难,特别是对于混合异步和并行数据流情况,您可以在这里使用。
至于你的其他问题:
如何使用IEnumerable&lt;&gt;和yeild一起返回async / await和dataflow。
async
和yield
完全不兼容。至少在今天的语言中。在您的情况下,JSON读者必须同步读取流(它们不支持异步读取),因此实际的流处理是同步的,可以与yield
一起使用。进行初始来回以获取流本身仍然可以是异步的,可以与async
一起使用。这是我们今天所能得到的,直到JSON读者支持异步读取并且语言支持async yield
。 (Rx今天可以做一个“异步产生”,但是JSON阅读器仍然不支持异步读取,因此在这种特殊情况下它无济于事。)
在这种情况下,如何跟踪完成情况。
如果JSON读者确实支持异步读取,那么上面的解决方案将不是最好的解决方案。在这种情况下,您 想要使用手动SendAsync
调用,并且只需链接这些块的完成,这可以这样做:
_processingBlock.Completion.ContinueWith(
task =>
{
if (task.IsFaulted)
((IDataflowBlock)_messageBufferBlock).Fault(task.Exception);
else if (!task.IsCanceled)
_messageBufferBlock.Complete();
},
CancellationToken.None,
TaskContinuationOptions.DenyChildAttach | TaskContinuationOptions.ExecuteSynchronously,
TaskScheduler.Default);
我应该使用数据块的LinkTo功能来连接各种块吗?或使用诸如OutputAvailableAsync()和ReceiveAsync()之类的方法将数据从一个块传播到另一个块。
尽可能使用LinkTo
。它可以为您处理所有角落案例。
//应该扔? //应该阻止这块?
这完全取决于你。默认情况下,当任何项目的任何处理失败时,块都会出错,如果您正在传播完成,则整个块链都会出错。
错误块是相当激烈的;他们扔掉任何正在进行的工作,拒绝继续处理。如果要重试,则必须构建新的数据流网格。
如果您更喜欢“更软”的错误策略,您可以catch
例外并执行类似记录(代码当前执行的操作)的操作,或者您可以更改数据流块的性质以传递例外为数据项。
答案 2 :(得分:2)
值得一看Rx。除非我遗漏了您需要的所有代码(除了现有的{
"compilerOptions": {
"module": "commonjs",
"target": "ES5",
"outDir": "<js dir>",
"rootDir": "<point to all ts dir>"
}
方法之外),否则它将如下所示:
ProcessFileAsync
完成。它以异步方式运行。可以通过致电var query =
fileNameUris
.Select(fileNameUri =>
Observable
.FromAsync(ct => ProcessFileAsync(fileNameUri, ct)))
.Merge(maxConcurrent : 4);
var subscription =
query
.Subscribe(
u => { },
() => { Console.WriteLine("Done."); });
取消。您可以指定最大并行度。