为什么添加并行for循环会降低单线程循环的速度?

时间:2018-04-05 19:16:53

标签: c# .net parallel-processing

当我偶然发现这个特殊情况时,我试图比较C#中单线程与并行的性能:

代码#1(仅限单线程)

static void Main(string[] args)
{
    var iterations = 1000000000;
    var sum = 0;

    var stp = new Stopwatch();
    stp.Start();
    for (int i = 0; i < iterations; i++)
    {
        sum++;
    }
    stp.Stop();

    Console.WriteLine("Single Thread");
    Console.WriteLine($"Sum: {sum}");
    Console.WriteLine($"Time Taken (ms): {stp.ElapsedMilliseconds}");
}

结果

  

单线程总和:1000000000时间(ms):351

代码#2(单线程并行并联)

static void Main(string[] args)
{
    var iterations = 1000000000;
    var sum = 0;

    var stp = new Stopwatch();
    stp.Start();
    for (int i = 0; i < iterations; i++)
    {
        sum++;
    }
    stp.Stop();

    Console.WriteLine("Single Thread");
    Console.WriteLine($"Sum: {sum}");
    Console.WriteLine($"Time Taken (ms): {stp.ElapsedMilliseconds}");

    sum = 0;
    stp.Reset();
    stp.Start();
    Parallel.For(0, iterations, i =>
    {
        sum++;
    });
    stp.Stop();
    Console.WriteLine("Parallel");
    Console.WriteLine($"Sum: {sum}");
    Console.WriteLine($"Time Taken (ms): {stp.ElapsedMilliseconds}");
}

结果

  

单线程总和:1000000000时间(ms):1865

     

平行和:275202313拍摄时间(毫秒):5831

为什么在添加并行部件后单螺纹部件的性能会发生如此大的变化?

这种差异:

  

代码#1   单线程总和:1000000000时间(ms):351

     

代码#2   单线程总和:1000000000时间(ms):1865

     对于同一段代码,

351 vs 1865 ms?

3 个答案:

答案 0 :(得分:3)

为每一个生成的IL是不同的。首先让我们看一下第一个例子(程序中没有并行,只包括Stopwatch.Stop()):

.method private hidebysig static void  Main(string[] args) cil managed
{
  .entrypoint
  // Code size       121 (0x79)
  .maxstack  2
  .locals init ([0] int32 iterations,
           [1] int32 sum,
           [2] class [System]System.Diagnostics.Stopwatch stp,
           [3] int32 i,
           [4] bool V_4)
  IL_0000:  nop
  IL_0001:  ldc.i4     0x3b9aca00  //Loads 10000000
  IL_0006:  stloc.0   //Store in the stack position 0
  IL_0007:  ldc.i4.0  //Push 0 onto the stack as int32
  IL_0008:  stloc.1   //Pop a variable from stack into position 1
  IL_0009:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
  IL_000e:  stloc.2
  IL_000f:  ldloc.2
  IL_0010:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
  IL_0015:  nop
  IL_0016:  ldc.i4.0
  IL_0017:  stloc.3
  IL_0018:  br.s       IL_0024
  IL_001a:  nop        //Start of For Loop
  IL_001b:  ldloc.1
  IL_001c:  ldc.i4.1
  IL_001d:  add
  IL_001e:  stloc.1
  IL_001f:  nop
  IL_0020:  ldloc.3
  IL_0021:  ldc.i4.1
  IL_0022:  add
  IL_0023:  stloc.3
  IL_0024:  ldloc.3
  IL_0025:  ldloc.0
  IL_0026:  clt
  IL_0028:  stloc.s    V_4
  IL_002a:  ldloc.s    V_4
  IL_002c:  brtrue.s   IL_001a    //If true, branch back to start
  IL_002e:  ldloc.2
  IL_002f:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

这是相对简单的,我添加了几条评论,但实际上这很简单。让我们与并行版本进行比较(再次,直到秒表停止的for循环):

.method private hidebysig static void  Main(string[] args) cil managed
{
  .entrypoint
  // Code size       257 (0x101)
  .maxstack  4
  .locals init ([0] class Test.Program/'c__DisplayClass0_0' 'CS$8__locals0',
           [1] int32 iterations,
           [2] class [System]System.Diagnostics.Stopwatch stp,
           [3] int32 i,
           [4] int32 V_4,
           [5] bool V_5)
  IL_0000:  newobj     instance void Test.Program/'c__DisplayClass0_0'::.ctor()
  IL_0005:  stloc.0
  IL_0006:  nop
  IL_0007:  ldc.i4     0x3b9aca00
  IL_000c:  stloc.1
  IL_000d:  ldloc.0
  IL_000e:  ldc.i4.0
  IL_000f:  stfld      int32 Test.Program/'c__DisplayClass0_0'::sum
  IL_0014:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
  IL_0019:  stloc.2
  IL_001a:  ldloc.2
  IL_001b:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
  IL_0020:  nop
  IL_0021:  ldc.i4.0
  IL_0022:  stloc.3
  IL_0023:  br.s       IL_003d
  IL_0025:  nop
  IL_0026:  ldloc.0
  IL_0027:  ldfld      int32 Test.Program/'c__DisplayClass0_0'::sum
  IL_002c:  stloc.s    V_4
  IL_002e:  ldloc.0
  IL_002f:  ldloc.s    V_4
  IL_0031:  ldc.i4.1
  IL_0032:  add
  IL_0033:  stfld      int32 Test.Program/'c__DisplayClass0_0'::sum
  IL_0038:  nop
  IL_0039:  ldloc.3
  IL_003a:  ldc.i4.1
  IL_003b:  add
  IL_003c:  stloc.3
  IL_003d:  ldloc.3
  IL_003e:  ldloc.1
  IL_003f:  clt
  IL_0041:  stloc.s    V_5
  IL_0043:  ldloc.s    V_5
  IL_0045:  brtrue.s   IL_0025
  IL_0047:  ldloc.2
  IL_0048:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

注意一件事? ldfldstfld来电?这些是对象模型指令而不是基本指令。这些是做什么的是从字段存储和加载而不是直接在堆栈上工作。那些电话费用更贵。为什么编译不同?

一方面,parallizing意味着所有线程都需要访问sum,因此编译器将sum更改为类级字段而不是局部变量。这是一个很大的区别,现在它必须使用编译器生成的字段而不是直接在堆栈上。另外,您会注意到编译器现在还创建了类的实例:

IL_0000: newobj instance void Test.Program/'c__DisplayClass0_0'::.ctor()

它仅用于访问sum字段,因此会产生更多开销。

我相信如果您更改了第二个示例,为sum使用单独的字段(例如sum2),它会更接近您的期望:

(与第二个相同,除了为第二个和使用不同的变量):

var iterations = 1000000000;
var sum = 0;
var stp = new Stopwatch();
stp.Start();
for (int i = 0; i < iterations; i++)
{
    sum++;
}
stp.Stop();

Console.WriteLine("Single Thread");
Console.WriteLine($"Sum: {sum}");
Console.WriteLine($"Time Taken (ms): {stp.ElapsedMilliseconds}");

var sum2 = 0;
stp.Reset();
stp.Start();
Parallel.For(0, iterations, x =>
{
    sum2++;
});
stp.Stop();
Console.WriteLine("Parallel");
Console.WriteLine($"Sum: {sum2}");
Console.WriteLine($"Time Taken (ms): {stp.ElapsedMilliseconds}");

Console.ReadKey(true);
.method private hidebysig static void  Main(string[] args) cil managed
{
  .entrypoint
  // Code size       244 (0xf4)
  .maxstack  4
  .locals init ([0] class Test.Program/'c__DisplayClass0_0' 'CS$8__locals0',
           [1] int32 iterations,
           [2] int32 sum,
           [3] class [System]System.Diagnostics.Stopwatch stp,
           [4] int32 i,
           [5] bool V_5)
  IL_0000:  newobj     instance void Test.Program/'c__DisplayClass0_0'::.ctor()
  IL_0005:  stloc.0
  IL_0006:  nop
  IL_0007:  ldc.i4     0x3b9aca00
  IL_000c:  stloc.1
  IL_000d:  ldc.i4.0
  IL_000e:  stloc.2
  IL_000f:  newobj     instance void [System]System.Diagnostics.Stopwatch::.ctor()
  IL_0014:  stloc.3
  IL_0015:  ldloc.3
  IL_0016:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Start()
  IL_001b:  nop
  IL_001c:  ldc.i4.0
  IL_001d:  stloc.s    i
  IL_001f:  br.s       IL_002d
  IL_0021:  nop
  IL_0022:  ldloc.2
  IL_0023:  ldc.i4.1
  IL_0024:  add
  IL_0025:  stloc.2
  IL_0026:  nop
  IL_0027:  ldloc.s    i
  IL_0029:  ldc.i4.1
  IL_002a:  add
  IL_002b:  stloc.s    i
  IL_002d:  ldloc.s    i
  IL_002f:  ldloc.1
  IL_0030:  clt
  IL_0032:  stloc.s    V_5
  IL_0034:  ldloc.s    V_5
  IL_0036:  brtrue.s   IL_0021
  IL_0038:  ldloc.3
  IL_0039:  callvirt   instance void [System]System.Diagnostics.Stopwatch::Stop()

除了一些不同的位置和偏移外,其与原始测试几乎相同。

答案 1 :(得分:1)

你需要温暖&#39;先循环你的循环。尝试连续10次执行第一个循环,您将看到第一次迭代后的时间减少。

你所看到的可能是JIT下半场的额外时间。

虽然,即使有一个预热循环,两者之间仍然存在差异,这可能与@Joel所说的一样,因为为了跨线程访问而添加了额外的检查。您可以通过更改并行循环来检查这一点,以使用自己的变量&#39; sum2&#39;然后时间似乎相等。

答案可能很简单,编译器选择优化第一个循环,以便在看不到并行访问时使用寄存器。

顺便说一句看看nuget上的benchmarkdotnet,它可以处理预热并运行多个测试以获得准确的测试时间。

答案 2 :(得分:-1)

操作并非完全并行。它们需要在主机线程上共享资源(sum变量)。因此,并行版本必须比单线程版本做更多的工作,因为安全访问共享资源需要额外的协调。

此外,您只是为了增加一个整数值而产生一堆线程。产生和排队线程所涉及的工作大于仅增加变量的工作。

一个更好的例子是,如果你有一个像数组或列表的集合,并希望对集合中的每个项目做大量的工作。然后可以在可用线程之间划分集合,并且每个线程比创建它的成本更多。