当我偶然发现这个特殊情况时,我试图比较C#中单线程与并行的性能:
代码#1(仅限单线程)
static void Main(string[] args)
{
var iterations = 1000000000;
var sum = 0;
var stp = new Stopwatch();
stp.Start();
for (int i = 0; i < iterations; i++)
{
sum++;
}
stp.Stop();
Console.WriteLine("Single Thread");
Console.WriteLine($"Sum: {sum}");
Console.WriteLine($"Time Taken (ms): {stp.ElapsedMilliseconds}");
}
结果
单线程总和:1000000000时间(ms):351
代码#2(单线程并行并联)
static void Main(string[] args)
{
var iterations = 1000000000;
var sum = 0;
var stp = new Stopwatch();
stp.Start();
for (int i = 0; i < iterations; i++)
{
sum++;
}
stp.Stop();
Console.WriteLine("Single Thread");
Console.WriteLine($"Sum: {sum}");
Console.WriteLine($"Time Taken (ms): {stp.ElapsedMilliseconds}");
sum = 0;
stp.Reset();
stp.Start();
Parallel.For(0, iterations, i =>
{
sum++;
});
stp.Stop();
Console.WriteLine("Parallel");
Console.WriteLine($"Sum: {sum}");
Console.WriteLine($"Time Taken (ms): {stp.ElapsedMilliseconds}");
}
结果
单线程总和:1000000000时间(ms):1865
平行和:275202313拍摄时间(毫秒):5831
为什么在添加并行部件后单螺纹部件的性能会发生如此大的变化?
这种差异:
代码#1 单线程总和:1000000000时间(ms):351
代码#2 单线程总和:1000000000时间(ms):1865
对于同一段代码,351 vs 1865 ms?
答案 0 :(得分:3)
为每一个生成的IL是不同的。首先让我们看一下第一个例子(程序中没有并行,只包括Stopwatch.Stop()
):
.method private hidebysig static void Main(string[] args) cil managed { .entrypoint // Code size 121 (0x79) .maxstack 2 .locals init ([0] int32 iterations, [1] int32 sum, [2] class [System]System.Diagnostics.Stopwatch stp, [3] int32 i, [4] bool V_4) IL_0000: nop IL_0001: ldc.i4 0x3b9aca00 //Loads 10000000 IL_0006: stloc.0 //Store in the stack position 0 IL_0007: ldc.i4.0 //Push 0 onto the stack as int32 IL_0008: stloc.1 //Pop a variable from stack into position 1 IL_0009: newobj instance void [System]System.Diagnostics.Stopwatch::.ctor() IL_000e: stloc.2 IL_000f: ldloc.2 IL_0010: callvirt instance void [System]System.Diagnostics.Stopwatch::Start() IL_0015: nop IL_0016: ldc.i4.0 IL_0017: stloc.3 IL_0018: br.s IL_0024 IL_001a: nop //Start of For Loop IL_001b: ldloc.1 IL_001c: ldc.i4.1 IL_001d: add IL_001e: stloc.1 IL_001f: nop IL_0020: ldloc.3 IL_0021: ldc.i4.1 IL_0022: add IL_0023: stloc.3 IL_0024: ldloc.3 IL_0025: ldloc.0 IL_0026: clt IL_0028: stloc.s V_4 IL_002a: ldloc.s V_4 IL_002c: brtrue.s IL_001a //If true, branch back to start IL_002e: ldloc.2 IL_002f: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
这是相对简单的,我添加了几条评论,但实际上这很简单。让我们与并行版本进行比较(再次,直到秒表停止的for循环):
.method private hidebysig static void Main(string[] args) cil managed { .entrypoint // Code size 257 (0x101) .maxstack 4 .locals init ([0] class Test.Program/'c__DisplayClass0_0' 'CS$8__locals0', [1] int32 iterations, [2] class [System]System.Diagnostics.Stopwatch stp, [3] int32 i, [4] int32 V_4, [5] bool V_5) IL_0000: newobj instance void Test.Program/'c__DisplayClass0_0'::.ctor() IL_0005: stloc.0 IL_0006: nop IL_0007: ldc.i4 0x3b9aca00 IL_000c: stloc.1 IL_000d: ldloc.0 IL_000e: ldc.i4.0 IL_000f: stfld int32 Test.Program/'c__DisplayClass0_0'::sum IL_0014: newobj instance void [System]System.Diagnostics.Stopwatch::.ctor() IL_0019: stloc.2 IL_001a: ldloc.2 IL_001b: callvirt instance void [System]System.Diagnostics.Stopwatch::Start() IL_0020: nop IL_0021: ldc.i4.0 IL_0022: stloc.3 IL_0023: br.s IL_003d IL_0025: nop IL_0026: ldloc.0 IL_0027: ldfld int32 Test.Program/'c__DisplayClass0_0'::sum IL_002c: stloc.s V_4 IL_002e: ldloc.0 IL_002f: ldloc.s V_4 IL_0031: ldc.i4.1 IL_0032: add IL_0033: stfld int32 Test.Program/'c__DisplayClass0_0'::sum IL_0038: nop IL_0039: ldloc.3 IL_003a: ldc.i4.1 IL_003b: add IL_003c: stloc.3 IL_003d: ldloc.3 IL_003e: ldloc.1 IL_003f: clt IL_0041: stloc.s V_5 IL_0043: ldloc.s V_5 IL_0045: brtrue.s IL_0025 IL_0047: ldloc.2 IL_0048: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
注意一件事? ldfld
和stfld
来电?这些是对象模型指令而不是基本指令。这些是做什么的是从字段存储和加载而不是直接在堆栈上工作。那些电话费用更贵。为什么编译不同?
一方面,parallizing意味着所有线程都需要访问sum
,因此编译器将sum
更改为类级字段而不是局部变量。这是一个很大的区别,现在它必须使用编译器生成的字段而不是直接在堆栈上。另外,您会注意到编译器现在还创建了类的实例:
IL_0000: newobj instance void Test.Program/'c__DisplayClass0_0'::.ctor()
它仅用于访问sum
字段,因此会产生更多开销。
我相信如果您更改了第二个示例,为sum
使用单独的字段(例如sum2
),它会更接近您的期望:
(与第二个相同,除了为第二个和使用不同的变量):
var iterations = 1000000000;
var sum = 0;
var stp = new Stopwatch();
stp.Start();
for (int i = 0; i < iterations; i++)
{
sum++;
}
stp.Stop();
Console.WriteLine("Single Thread");
Console.WriteLine($"Sum: {sum}");
Console.WriteLine($"Time Taken (ms): {stp.ElapsedMilliseconds}");
var sum2 = 0;
stp.Reset();
stp.Start();
Parallel.For(0, iterations, x =>
{
sum2++;
});
stp.Stop();
Console.WriteLine("Parallel");
Console.WriteLine($"Sum: {sum2}");
Console.WriteLine($"Time Taken (ms): {stp.ElapsedMilliseconds}");
Console.ReadKey(true);
.method private hidebysig static void Main(string[] args) cil managed { .entrypoint // Code size 244 (0xf4) .maxstack 4 .locals init ([0] class Test.Program/'c__DisplayClass0_0' 'CS$8__locals0', [1] int32 iterations, [2] int32 sum, [3] class [System]System.Diagnostics.Stopwatch stp, [4] int32 i, [5] bool V_5) IL_0000: newobj instance void Test.Program/'c__DisplayClass0_0'::.ctor() IL_0005: stloc.0 IL_0006: nop IL_0007: ldc.i4 0x3b9aca00 IL_000c: stloc.1 IL_000d: ldc.i4.0 IL_000e: stloc.2 IL_000f: newobj instance void [System]System.Diagnostics.Stopwatch::.ctor() IL_0014: stloc.3 IL_0015: ldloc.3 IL_0016: callvirt instance void [System]System.Diagnostics.Stopwatch::Start() IL_001b: nop IL_001c: ldc.i4.0 IL_001d: stloc.s i IL_001f: br.s IL_002d IL_0021: nop IL_0022: ldloc.2 IL_0023: ldc.i4.1 IL_0024: add IL_0025: stloc.2 IL_0026: nop IL_0027: ldloc.s i IL_0029: ldc.i4.1 IL_002a: add IL_002b: stloc.s i IL_002d: ldloc.s i IL_002f: ldloc.1 IL_0030: clt IL_0032: stloc.s V_5 IL_0034: ldloc.s V_5 IL_0036: brtrue.s IL_0021 IL_0038: ldloc.3 IL_0039: callvirt instance void [System]System.Diagnostics.Stopwatch::Stop()
除了一些不同的位置和偏移外,其与原始测试几乎相同。
答案 1 :(得分:1)
你需要温暖&#39;先循环你的循环。尝试连续10次执行第一个循环,您将看到第一次迭代后的时间减少。
你所看到的可能是JIT下半场的额外时间。
虽然,即使有一个预热循环,两者之间仍然存在差异,这可能与@Joel所说的一样,因为为了跨线程访问而添加了额外的检查。您可以通过更改并行循环来检查这一点,以使用自己的变量&#39; sum2&#39;然后时间似乎相等。
答案可能很简单,编译器选择优化第一个循环,以便在看不到并行访问时使用寄存器。
顺便说一句看看nuget上的benchmarkdotnet,它可以处理预热并运行多个测试以获得准确的测试时间。
答案 2 :(得分:-1)
操作并非完全并行。它们需要在主机线程上共享资源(sum
变量)。因此,并行版本必须比单线程版本做更多的工作,因为安全访问共享资源需要额外的协调。
此外,您只是为了增加一个整数值而产生一堆线程。产生和排队线程所涉及的工作大于仅增加变量的工作。
一个更好的例子是,如果你有一个像数组或列表的集合,并希望对集合中的每个项目做大量的工作。然后可以在可用线程之间划分集合,并且每个线程比创建它的成本更多。