将元组作为固定大小的向量进行有效处理

时间:2017-11-10 21:27:28

标签: performance parallel-processing tuples chapel parallelism-amdahl

在Chapel中,可以使用同质元组,就像它们是小“向量”一样(例如,a = b + c * 3.0 + 5.0;)。

但是,由于没有为元组提供各种数学函数,我尝试用几种方法编写norm()函数并比较它们的性能。我的代码是这样的:

proc norm_3tuple( x: 3*real ): real
{
    return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}

proc norm_loop( x ): real
{
    var tmp = 0.0;
    for i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_loop_param( x ): real
{
    var tmp = 0.0;
    for param i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_reduce( x ): real
{
    var tmp = ( + reduce x**2 );
    return sqrt( tmp );
}

//.........................................................

var a = ( 1.0, 2.0, 3.0 );

// consistency check
writeln( norm_3tuple(     a ) );
writeln( norm_loop(       a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce(     a ) );

config const nloops = 100000000;  // 1E+8

var res = 0.0;
for k in 1 .. nloops
{
    a[ 1 ] = (k % 5): real;

    res += norm_3tuple(     a );
 // res += norm_loop(       a );
 // res += norm_loop_param( a );
 // res += norm_reduce(     a );
}

writeln( "result = ", res );

我用chpl --fast test.chpl编译了上面的代码(OSX10.11上的Chapel v1.16,带有4个内核,通过自制软件安装)。然后,norm_3tuple()norm_loop()norm_loop_param()提供了几乎相同的速度(0.45秒),而norm_reduce()则慢得多(约30秒)。我检查了top命令的输出,然后norm_reduce()使用了所有4个内核,而其他功能只使用了1个内核。所以我的问题是......

  • norm_reduce()reduce并行工作而变慢 并行执行的开销很大 大于这个小元组的净计算成本?
  • 鉴于我们希望避免3个元组的reduce,其他三个例程基本上以相同的速度运行。这是否意味着显式for循环对3元组的成本可以忽略不计(例如,通过--fast选项启用循环展开)?
  • norm_loop_param()中,我也尝试使用param关键字作为循环变量,但这给了我很少或没有性能提升。如果我们只对同构元组感兴趣,是否根本不需要附加param(性能)?

我很抱歉很多问题,我很感激任何有效处理小元组的建议/建议。非常感谢!

2 个答案:

答案 0 :(得分:2)

  

norm_reduce() 缓慢,因为 reduce 并行工作,并行执行的开销远大于这个小的净计算成本元组?

我相信你这是正确的,这是正确的。减少是并行执行的,Chapel目前并没有尝试进行任何智能限制以在工作可能无法保证的情况下压制这种并行性(就像在这种情况下那样),所以我认为你已经遭受了太多的折磨任务开销除了协调其他任务之外几乎没有任何工作(虽然我很惊讶差异的大小如此之大......但我也发现我对这些事情没什么直觉)。将来,我们希望编译器能够序列化这么小的减少量,以避免这些开销。

  

鉴于我们要避免3个元组的 reduce ,其他三个例程基本上以相同的速度运行。这是否意味着显式for - 循环对3元组的成本可忽略不计(例如,通过 --fast 选项启用循环展开)?

Chapel编译器没有在norm_loop()中展开显式for循环(您可以通过检查使用 --savec 标志生成的代码来验证这一点),但它可能是后端编译器。或者,与norm_loop_param()的展开循环相比,for循环的成本确实不高。我怀疑你需要检查生成的程序集以确定是哪种情况。但我也期望后端C编译器可以很好地处理我们生成的代码 - 例如,它很容易看到它是一个3迭代循环。

  

norm_loop_param() 中,我还尝试将 param 关键字用于循环变量,但这给了我很少或没有性能提升。如果我们只对同构元组感兴趣,是否根本不需要附加param(性能)?

这很难给出明确的答案,因为我认为这主要是关于后端C编译器有多好的问题。

答案 1 :(得分:1)

事后评论:实际上最后还有第三次出色的表现惊喜......

性能?
基准! ...总是,没有例外,没有借口

这就是如此伟大的原因。非常感谢Chapel团队在过去十年中为HPC开发和改进这样出色的计算工具。

真正热爱真诚 - [PARALLEL]努力,性能始终是设计实践和底层系统硬件的结果,永远不会只是语法构造函数授予“奖励”

norm_reduce() 系统处理花费几毫秒只是为了设置所有启用了并发功能的 reduce 计算设施稍后只生成并将单个x**2产品返回到结果队列中,以便延迟中央 + - reductor-engine求和。单个2 CLK CPU uops的开销很大,不是吗?

出于为什么的原因,可以review the costs of process-scheduling details and my updated criticism of Amdahl's Law original formulation.

代码基准测试 - 实际上实现了两个意外:

+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.RUN
                                        3.74166
[SEQ]       norm_loop():    0.0 [us] -- 3.74166
[SEQ] norm_loop_param():    0.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 5677.0 [us] -- 3.74166

                                        3.74166
[SEQ]       norm_loop():    0.0 [us] -- 3.74166
[SEQ] norm_loop_param():    1.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 5818.0 [us] -- 3.74166

                                        3.74166
[SEQ]       norm_loop():    1.0 [us] -- 3.74166
[SEQ] norm_loop_param():    2.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 4886.0 [us] -- 3.74166

第一个是在原帖中报道的,第二个是在Chapel运行后配备 --fast 编译器开关后观察到的:

+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.+CompilerFLAG( "--fast" ).RUN
                                        3.74166
[SEQ]       norm_loop():    1.0 [us] -- 3.74166
[SEQ] norm_loop_param():    2.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 7769.0 [us] -- 3.74166

                                        3.74166
[SEQ]       norm_loop():    0.0 [us] -- 3.74166
[SEQ] norm_loop_param():    0.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 9109.0 [us] -- 3.74166

                                        3.74166
[SEQ]       norm_loop():    1.0 [us] -- 3.74166
[SEQ] norm_loop_param():    1.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 8807.0 [us] -- 3.74166

与往常一样,SuperComputing2017 HPC促进了技术论文或基准测试中发布的每个方面的[再现性]。

这些结果是在Try-it-Online赞助的在线平台上收集的,欢迎所有感兴趣的爱好者重新运行并发布他们的本地主机/群集操作的Chapel-code性能详细信息,以便更好地记录上述观察时间(for further experimentation with a ready-to-run timing decorated code, may use this link to a state-full snapshot of the TiO.IDE)的硬件系统相关的可变性。

/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer;

proc norm_3tuple( x: 3*real ): real
{
    return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}

proc norm_loop( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
    var tmp = 0.0;
    for i in 1 .. x.size do
        tmp += x[i]**2;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write(                          "[SEQ]       norm_loop(): ",
                                                                       aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
    return sqrt( tmp );
}

proc norm_loop_param( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
    var tmp = 0.0;
    for param i in 1 .. x.size do
        tmp += x[i]**2;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write(                          "[SEQ] norm_loop_param(): ",
                                                                       aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
    return sqrt( tmp );
}

proc norm_reduce( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start();
    var tmp = ( + reduce x**2 );
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop(); write(                          "[PAR]:    norm_reduce(): ",
                                                                       aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
    return sqrt( tmp );
}

//.........................................................

var a = ( 1.0, 2.0, 3.0 );

// consistency check
writeln( norm_3tuple(     a ) );
writeln( norm_loop(       a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce(     a ) );

Scaling:

 [LOOP] norm_3tuple():       45829.0 [us] -- result = 4.30918e+06 @   1000000 loops.
 [LOOP] norm_3tuple():      241680   [us] -- result = 4.30918e+07 @  10000000 loops.
 [LOOP] norm_3tuple():     2387080   [us] -- result = 4.30918e+08 @ 100000000 loops.
[LOOP]  norm_loop():         72160.0 [us] -- result = 4.30918e+06 @   1000000 loops.
[LOOP]  norm_loop():        755959   [us] -- result = 4.30918e+07 @  10000000 loops.
[LOOP]  norm_loop():       7783740   [us] -- result = 4.30918e+08 @ 100000000 loops.
[LOOP]  norm_loop_param():   34102.0 [us] -- result = 4.30918e+06 @   1000000 loops.
[LOOP]  norm_loop_param():  365510   [us] -- result = 4.30918e+07 @  10000000 loops.
[LOOP]  norm_loop_param(): 3480310   [us] -- result = 4.30918e+08 @ 100000000 loops.
-------------------------------------------------------------------------1000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():     5851380   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     5884600   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6163690   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6029860   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6083730   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6132720   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6012620   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6379020   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     5923550   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6144660   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     8098380   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     6215470   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     5831670   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     6124580   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     6092740   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     5811260   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     5880400   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     5898520   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     6591110   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     5876570   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     6034180   [us] -- result = 4309.18     @      1000 loops. [--fast]


-------------------------------------------------------------------------2000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    12434700   [us] -- result = 8618.36     @      2000 loops.


-------------------------------------------------------------------------3000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    17807600   [us] -- result = 12927.5     @      3000 loops.


-------------------------------------------------------------------------4000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    23844300   [us] -- result = 17236.7     @      4000 loops.


-------------------------------------------------------------------------5000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    30557700   [us] -- result = 21545.9     @      5000 loops.
[LOOP]  norm_reduce():    30523700   [us] -- result = 21545.9     @      5000 loops.
[LOOP]  norm_reduce():    29404200   [us] -- result = 21545.9     @      5000 loops.
[LOOP]  norm_reduce():    29268600   [us] -- result = 21545.9     @      5000 loops. [--fast]
[LOOP]  norm_reduce():    29009500   [us] -- result = 21545.9     @      5000 loops. [--fast]
[LOOP]  norm_reduce():    30388800   [us] -- result = 21545.9     @      5000 loops. [--fast]


-------------------------------------------------------------------------6000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    37070600   [us] -- result = 25855.1     @      6000 loops.


-------------------------------------------------------------------------7000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    42789200   [us] -- result = 30164.3     @      7000 loops.


---------------------------------------------------------------------8000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    50572700   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    49944300   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    49365600   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():   ~60+                                                                 // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP]  norm_reduce():    50099900   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    49445500   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    49783800   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    48533400   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    48966600   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    47564700   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    47087400   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    47624300   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():   ~60+                                                        [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP]  norm_reduce():   ~60+                                                        [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP]  norm_reduce():    46887700   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():    46571800   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():    46794700   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():    46862600   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():    47348700   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():    46669500   [us] -- result = 34473.4     @      8000 loops. [--fast]

出现了第三个惊喜 - 来自going into a forall do { ... }:

虽然 [SEQ] - nloops - ed代码被严重破坏了相关的附加开销,但重新制定的一个小问题显示了非常不同的性能级别甚至可以在单CPU平台上实现(多CPU代码执行的性能提升得越多)以及 --fast 编译器开关在此处生成的效果:

/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_LOOP: Timer;

config const nloops = 100000000;  // 1E+8    
       var   res: atomic real;
             res.write( 0.0 );
//------------------------------------------------------------------// PRE-COMPUTE:
var A1:    [1 .. nloops] real;                                      // pre-compute a tuple-element value
forall k in 1 .. nloops do                                          // pre-compute a tuple-element value
    A1[k] = (k % 5): real;                                          // pre-compute a tuple-element value to a ( k % 5 ), ex-post typecast to real

/* ---------------------------------------------SECTION-UNDER-TEST--*/  aStopWATCH_LOOP.start();
forall i in 1 .. nloops do
{               //  a[1] = (  i % 5 ): real;                        // pre-compute'd
   res.add( norm_reduce( ( A1[i],            a[1], a[2] ) ) );      //     atomic.add()
// res +=   norm_reduce( ( (  i % 5 ): real, a[1], a[2] ) );        // non-atomic
//:49: note: The shadow variable 'res' is constant due to forall intents in this loop

}/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.stop(); write(
  "forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: ",     aStopWATCH_LOOP.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
/* 
   --------------------------------------------------------------------------------------------------------{-nloops-}-------{--fast}-------------
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:     7911.0 [us] -- result =     320.196 @       100 loops. 
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:     8055.0 [us] -- result =    3201.96  @      1000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:     8002.0 [us] -- result =   32019.6   @     10000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:    80685.0 [us] -- result = 3.20196e+05 @    100000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:   842948   [us] -- result = 3.20196e+06 @   1000000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  8005300   [us] -- result = 3.20196e+07 @  10000000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40358900   [us] -- result = 1.60098e+08 @  50000000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40671200   [us] -- result = 1.60098e+08 @  50000000 loops.

   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  2195000   [us] -- result = 1.60098e+08 @  50000000 loops. [--fast]

   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4518790   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  6178440   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4755940   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4405480   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4509170   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4736110   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4653610   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4397990   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4655240   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
  */