我有一个计算,我使用PLINQ进行并行化,如下所示:
源IEnumerable<T> source
提供从a读取的对象
文件
我需要做一个重量级计算HeavyComputation
每个T
,我希望这些线程跨越线程,所以我
使用PLINQ,如:AsParallel().Select(HeavyComputation)
这里有趣的地方:由于文件的限制
提供source
的读者类型,我需要source
枚举在初始线程上,而不是并行工作者---我需要
对source
绑定的完整评估
线。然而,似乎源实际上是在工人身上列举的
线程。
我的问题是:是否有一种直接的方法来修改此代码
将source
的枚举绑定到初始线程,而
将繁重的工作耕种给并行工人?请记住
只是在.ToList()
之前做一个热切的AsParallel()
不是一个选项,
因为来自文件的数据流很大。
以下是一些示例代码,用于演示我遇到的问题:
using System.Threading;
using System.Collections.Generic;
using System.Linq;
using System;
public class PlinqTest
{
private static string FormatItems<T>(IEnumerable<T> source)
{
return String.Format("[{0}]", String.Join(";", source));
}
public static void Main()
{
var expectedThreadIds = new[] { Thread.CurrentThread.ManagedThreadId };
var threadIds = Enumerable.Range(1, 1000)
.Select(x => Thread.CurrentThread.ManagedThreadId) // (1)
.AsParallel()
.WithDegreeOfParallelism(8)
.WithExecutionMode(ParallelExecutionMode.ForceParallelism)
.AsOrdered()
.Select(x => x) // (2)
.ToArray();
// In the computation above, the lambda in (1) is a
// stand in for the file-reading operation that we
// want to be bound to the main thread, while the
// lambda in (2) is a stand-in for the "expensive
// computation" that we want to be farmed out to the
// parallel worker threads. In fact, (1) is being
// executed on all threads, as can be seen from the
// output.
Console.WriteLine("Expected thread IDs: {0}",
FormatItems(expectedThreadIds));
Console.WriteLine("Found thread IDs: {0}",
FormatItems(threadIds.Distinct()));
}
}
我得到的示例输出是:
Expected thread IDs: [1]
Found thread IDs: [7;4;8;6;11;5;10;9]
答案 0 :(得分:1)
如果您放弃PLINQ并明确使用任务并行库,这是相当简单的(尽管可能不那么简洁):
// Limits the parallelism of the "expensive task"
var semaphore = new SemaphoreSlim(8);
var tasks = Enumerable.Range(1, 1000)
.Select(x => Thread.CurrentThread.ManagedThreadId)
.Select(async x =>
{
await semaphore.WaitAsync();
var result = await Task.Run(() => Tuple.Create(x, Thread.CurrentThread.ManagedThreadId));
semaphore.Release();
return result;
});
return Task.WhenAll(tasks).Result;
请注意,我正在使用Tuple.Create
来记录来自主线程的线程ID和来自生成任务的线程ID。从我的测试来看,前者对于每个元组总是相同的,而后者是变化的,这是应该的。
信号量确保并行度永远不会超过8(尽管创建元组的成本很低,但这无论如何都不太可能)。如果你到8,任何新的任务都会等到信号量上有可用的点。
答案 1 :(得分:0)
您可以使用下面的 OffloadQueryEnumeration
方法,它确保源序列的枚举将发生在枚举结果 IEnumerable<TResult>
的同一线程上。 querySelector
是将源序列的代理转换为 https://www.confluent.io/kafka-summit-san-francisco-2019/whats-the-time-and-why/ 的委托。此查询在 ThreadPool
线程内部枚举,但输出值会返回到当前线程。
/// <summary>Enumerates the source sequence on the current thread, and enumerates
/// the projected query on a ThreadPool thread.</summary>
public static IEnumerable<TResult> OffloadQueryEnumeration<TSource, TResult>(
this IEnumerable<TSource> source,
Func<IEnumerable<TSource>, IEnumerable<TResult>> querySelector)
{
// Arguments validation omitted
var locker = new object();
(TSource Value, bool HasValue) input = default; bool inputCompleted = false;
(TResult Value, bool HasValue) output = default; bool outputCompleted = false;
using var sourceEnumerator = source.GetEnumerator();
IEnumerable<TSource> GetSourceProxy()
{
while (true)
{
TSource item;
lock (locker)
{
if (!input.HasValue)
{
if (inputCompleted) yield break;
Monitor.Wait(locker); continue;
}
item = input.Value; input = default;
Monitor.PulseAll(locker);
}
yield return item;
}
}
var query = querySelector(GetSourceProxy());
var task = Task.Run(() =>
{
try
{
foreach (var result in query)
{
lock (locker)
{
while (output.HasValue) Monitor.Wait(locker);
output = (result, true);
Monitor.PulseAll(locker);
}
}
}
finally
{
lock (locker) { outputCompleted = true; Monitor.PulseAll(locker); }
}
});
Exception sourceEnumeratorException = null;
while (true)
{
TResult result;
lock (locker)
{
if (output.HasValue)
{
result = output.Value; output = default;
Monitor.PulseAll(locker);
goto yieldResult;
}
if (outputCompleted) break;
if (input.HasValue || inputCompleted)
{
Monitor.Wait(locker); continue;
}
try
{
if (sourceEnumerator.MoveNext())
input = (sourceEnumerator.Current, true);
else
inputCompleted = true;
}
catch (Exception ex)
{
sourceEnumeratorException = ex;
inputCompleted = true;
}
Monitor.PulseAll(locker); continue;
}
yieldResult:
yield return result;
}
task.GetAwaiter().GetResult(); // Propagate possible exceptions
lock (locker) if (sourceEnumeratorException != null)
ExceptionDispatchInfo.Capture(sourceEnumeratorException).Throw();
}
此方法使用 ParallelQuery<T>
/Monitor.Wait
机制 (Monitor.Pulse
),以便同步值从一个线程到另一个线程的传输。
用法示例:
int[] threadIds = Enumerable
.Range(1, 1000)
.Select(x => Thread.CurrentThread.ManagedThreadId)
.OffloadQueryEnumeration(proxy => proxy
.AsParallel()
.WithDegreeOfParallelism(8)
.WithExecutionMode(ParallelExecutionMode.ForceParallelism)
.AsOrdered()
.Select(x => x)
)
.ToArray();