Question

我正在尝试提高矩阵乘法速度。还有其他我可以做到的实现可以加快它的实现到目前为止，这是我的结果，我尝试执行8192，但花了2个小时以上，并且ssh连接超时。

这是我的实现：

use Random, Time;
var t : Timer;
t.start();

config const size = 10;
var grid : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;

fillRandom(grid);
fillRandom(grid2);

//t.start();
forall i in 1..size {
    forall j in 1..size {
        forall k in 1..size {
            grid3[i,j] += grid[i,k] * grid2[k,j];
        }
    }
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear();

我正在将时间与c ++中的MPI实现进行比较。我想知道是否可以将矩阵分配到类似于MPI的两个语言环境？

Answer 1

这种forall循环嵌套在我们当前的实现中不能提供最佳性能。如果您在定义（i，j）迭代空间的单个二维域上进行迭代，则算法的执行速度将更快。在k上进行串行循环将避免在更新grid3 [i，j]时发生数据争用。例如：

....
const D2 = {1..size, 1..size};
forall (i,j) in D2 do
  for k in 1..size do
    grid3[i,j] += grid[i,k] * grid2[k,j];

要分配矩阵，可以使用例如Block分配-参见我们的online docs中的示例。分发时，您当然需要注意区域设置之间的其他通信。

测试性能时，请务必使用--fast进行编译。

Answer 2

在Chapel中，forall循环不会自动在不同的语言环境中分配工作或数据（请考虑：计算节点或内存）。相反，forall循环调用与要迭代的事物相关联的并行迭代器。

因此，如果要遍历单个语言环境的局部内容，例如范围（如代码中对1..size的使用）或非分布式域或数组（例如grid在您的代码中），用于实现并行循环的所有任务都将在原始语言环境中本地执行。相反，如果您要遍历分布式域或数组（例如，Block-distributed）或调用分布式迭代器（例如，来自DistributedIters模块的迭代器），则任务将是分布式的遍历iterand定位的所有区域。

结果，任何不引用其他语言环境的Chapel程序（无论是通过子句显式地还是通过包裹子句的抽象来隐式引用，例如上述的分布式数组和迭代器）将永远不会使用除初始语言环境。

我还想提供有关分布式算法的附带说明：即使您要更新上面的程序以在多个语言环境中分布网格数组和forall循环，三重嵌套循环方法也很少是一种最佳的矩阵乘法算法。分布式存储系统，因为它不能很好地针对本地进行优化。最好研究为分布式内存设计的矩阵乘法算法（例如SUMMA）。

Answer 3

_{事后备注：：请参阅基准Vass在托管生态系统上建议的性能优势，在多语言环境中重现此效果将非常有用硅以证明其在更大规模上具有通用性...}

性能？
基准！_{... always，没有例外，没有借口-很少有惊喜-ccflags -O3}

这就是chapel如此出色的原因。非常感谢Chapel团队在过去十年中为HPC开发和改进了如此出色的计算工具。

全心全意地投入[PARALLEL]，性能始终是设计实践和底层系统硬件的结果，从来没有一个语法构造函数被授予“额外奖励”

使用TiO.run online IDE for demo-test，单个语言环境硅的结果如下：

TiO.run platform uses   1 numLocales,
               having   2 physical CPU-cores accessible (numPU-s)
                 with   2 maxTaskPar parallelism limit

For grid{1,2,3}[ 128, 128] the tested forall sum-product took       3.124 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the tested forall sum-product took       2.183 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the Vass-proposed sum-product took       1.925 [us] excl. fillRandom()-ops

For grid{1,2,3}[ 256, 256] the tested forall sum-product took      28.593 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the tested forall sum-product took      25.254 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the Vass-proposed sum-product took      21.493 [us] excl. fillRandom()-ops

For grid{1,2,3}[1024,1024] the tested forall sum-product took   2.658.560 [us] incl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the tested forall sum-product took   2.604.783 [us] excl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the Vass-proposed sum-product took   2.103.592 [us] excl. fillRandom()-ops

For grid{1,2,3}[2048,2048] the tested forall sum-product took  27.137.060 [us] incl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the tested forall sum-product took  26.945.871 [us] excl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the Vass-proposed sum-product took  25.351.754 [us] excl. fillRandom()-ops

For grid{1,2,3}[2176,2176] the tested forall sum-product took  45.561.399 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took  45.375.282 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took  41.304.391 [us] excl. fillRandom()-ops

--fast --ccflags -O3

For grid{1,2,3}[2176,2176] the tested forall sum-product took  39.680.133 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took  39.494.035 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took  44.611.009 [us] excl. fillRandom()-ops

结果：

在单语言环境（单个虚拟处理器，2个内核，运行所有公共时间受限（<60 [s]）分时公共工作负载online IDE）上，设备仍然不是最佳选择（Vass'与Nvidia Jetson托管的计算相比，该方法可以使 20%更快）的网格规模2048x2048可以更快地获得 ~ 4.17 x 的结果。随着内存I / O范围的扩大以及CPU-L1 / L2 / L3高速缓存的预取将会有所提高，这种性能优势似乎会随着更大的内存布局而进一步扩大（应遵循 ~ O(n^3) 扩展）进一步提高了基于CPU的NUMA平台的性能优势。

很高兴看到在多语言环境的设备和本机Cray NUMA集群平台上执行时可以实现的性能和~ O(n^3)扩展服从性。

use Random, Time, IO.FormattedIO;
var t1 : Timer;
var t2 : Timer;

t1.start(); //----------------------------------------------------------------[1]

config const size = 2048;

var grid1 : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;

fillRandom(grid1);
fillRandom(grid2);

t2.start(); //====================================[2]

forall i in 1..size {
    forall j in 1..size {
        forall k in 1..size {
            grid3[i,j] += grid1[i,k] * grid2[k,j];
        }
    }
}

t2.stop(); //=====================================[2]
t1.stop(); //-----------------------------------------------------------------[1]
             writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] incl. fillRandom()-ops\n",
                      size,
                      size,
                      t1.elapsed( TimeUnits.microseconds )
                      );
             writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] excl. fillRandom()-ops\n",
                      size,
                      size,
                      t2.elapsed( TimeUnits.microseconds )
                      );
///////////////////////////////////////////////////////////////////////////////////
t1.clear();
t2.clear();

const D3 = {1..size, 1..size, 1..size};

t2.start(); //====================================[3]

forall (i,j,k) in D3 do
  grid3[i,j] += grid1[i,k] * grid2[k,j];

t2.stop(); //=====================================[3]
             writef( "For grid{1,2,3}[%4i,%4i] the Vass-proposed sum-product took %12i [us] excl. fillRandom()-ops\n",
                      size,
                      size,
                      t2.elapsed( TimeUnits.microseconds )
                      );
//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\
//       TiO.run platform uses   1 numLocales, having   2 physical CPU-cores accessible (numPU-s) with   2 maxTaskPar parallelism limit
writef( "TiO.run platform uses %3i numLocales, having %3i physical CPU-cores accessible (numPU-s) with %3i maxTaskPar parallelism limit\n",
                      numLocales,
                      here.numPUs( false, true ),
                      here.maxTaskPar
                      );

我正在尝试使用Chapel改善矩阵乘法的运行时间

3 个答案:

性能？
基准！_{... always，没有例外，没有借口-很少有惊喜-ccflags -O3}

结果：

我正在尝试使用Chapel改善矩阵乘法的运行时间

3 个答案:

性能？ 基准！ ... always，没有例外，没有借口-很少有惊喜-ccflags -O3

结果：

性能？
基准！_{... always，没有例外，没有借口-很少有惊喜-ccflags -O3}