Question

根据教堂的可用文档，（整个）阵列像

A = B + alpha * C;   // with A, B, and C being arrays, and alpha some scalar

在以下所有迭代中均以该语言实现：

forall (a,b,c) in zip(A,B,C) do
   a = b + alpha * c;

因此，数组语句默认情况下由一组并行执行线程。不幸的是，这似乎也完全排除了（部分或全部）此类语句的向量化。这个对于习惯于Fortran或Python / Numpy之类的语言（通常的默认行为通常是仅对数组语句进行矢量化处理）的程序员，可能会导致性能意外。

对于使用（整个）数组语句和小数组的代码大小适中，向量化损失（由Linux硬件确认性能计数器）和固有的大量开销并行线程（不适合有效利用在此类问题中可以使用细粒度的数据并行性）严重降低性能。例如，考虑以下版本的Jacobi迭代都解决了相同的问题在300 x 300区域的域中：

Jacobi_1使用数组语句，如下所示：

/*
 *  Jacobi_1
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using (whole-)array statements.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  X[n+1,1..n] = 1.0;                 // Set south boundary values to 1.0

  do {

    // compute next approximation
    XNew[1..n,1..n] =
      ( X[0..n-1,1..n] + X[2..n+1,1..n] +
        X[1..n,2..n+1] + X[1..n,0..n-1] ) / 4.0;

    // update X with next approximation
    X[1..n,1..n] = XNew[1..n,1..n];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

Jacobi_2在整个（即（仅）（自动）矢量化）过程中采用串行for循环由后端C编译器允许）：

/*
 *  Jacobi_2
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using (serial) for-loops.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  for j in 1..n do
    X[n+1,j] = 1.0;                  // Set south boundary values to 1.0

  do {

    // compute next approximation
    for i in 1..n do
      for j in 1..n do  
        XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
                      X[i,j+1] + X[i,j-1] ) / 4.0;

    // update X with next approximation
    for i in 1..n do
      for j in 1..n do
        X[i,j] = XNew[i,j];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

最后，

Jacobi_3具有矢量化的最里面的循环，并且只有最外层的线程：

/*
 *  Jacobi_3
 *
 *  This program (adapted from the Chapel distribution) performs
 *  niter iterations of the Jacobi method for the Laplace equation
 *  using both parallel and serial (vectorized) loops.
 *
 */

config var n = 300;                  // size of n x n grid
config var niter = 10000;            // number of iterations to perform

proc main() {

  const Domain = {0..n+1,0..n+1};    // domain including boundary points

  var iteration = 0;                 // iteration counter
  var X, XNew: [Domain] real = 0.0;  // declare arrays: 
                                     //   X stores approximate solution
                                     //   XNew stores the next solution  
  for j in vectorizeOnly(1..n) do
    X[n+1,j] = 1.0;                  // Set south boundary values to 1.0

  do {

    // compute next approximation
    forall i in 1..n do
      for j in vectorizeOnly(1..n) do
        XNew[i,j] = ( X[i-1,j] + X[i+1,j] +
                      X[i,j+1] + X[i,j-1] ) / 4.0;

    // update X with next approximation
    forall i in 1..n do
      for j in vectorizeOnly(1..n) do
        X[i,j] = XNew[i,j];

    // advance iteration counter
    iteration += 1;

  } while (iteration < niter);

  writeln("Jacobi computation complete.");
  writeln("# of iterations: ", iteration);

} // main

在具有2个处理器核心并使用两个处理器核心的笔记本电脑上运行这些代码并行线程，人们发现Jacobi_1是（令人惊讶地）比Jacobi_2慢十倍，Jacobi_3本身（预期）比Jacobi_1慢1.6倍。

不幸的是，此默认行为使数组语句完全可用对我的用例没有吸引力，即使对于有益的算法从更简洁的符号和可读性（整个）数组语句可以提供。

Chapel中的用户是否可以更改此默认行为？也就是说，用户可以自定义整个数组的默认并行化吗？语句以Jacobi_2中使用的此类数组语句将的行为类似于Jacobi_3中的代码（这对代码开发和调试目的很有用）或vectorizeOnly()中的代码（在这三者中，这是进行生产计算的首选方法）？

我试图通过将对“ "/api": { "target": "http://ip:8889/api", "changeOrigin": true, "pathRewrite": { "^/api" : "" } } //proxy configuration file”的调用插入到上面“域”的定义，但无济于事。

Answer 1

Chapel的意图是在用于实现forall循环的每个任务串行循环中（对于合法可矢量化的情况）。正如您所指出的那样，今天该功能还没有得到很好的支持（即使您使用的vectorizeOnly()迭代器也仅被视为to support vectorization automatically）。

我要提到的是，使用Chapel的LLVM后端时，与使用（默认）C后端相比，我们倾向于看到更好的矢量化结果，而当使用Simon Moll的LLVM-时，我们看到的结果甚至更好。基于区域向量器（萨尔大学）。但是我们也看到LLVM后端的性能不及C后端的情况，因此您的工作量可能会有所不同。但是，如果您关心向量化，那就值得一试。

对于您的特定问题：

Chapel中的用户是否可以更改此默认行为？

有。对于显式的forall循环，您可以使用prototypical来为forall循环指定与我们的默认迭代器不同的实现策略。如果实现自己喜欢的实现，则可以编写（或克隆和修改）write your own parallel iterator domain map，以控制默认情况下如何实现给定数组上的循环（即，如果未明确调用迭代器，）。这样一来，最终用户可以为Chapel数组指定与我们默认支持的策略不同的实现策略。

关于您的三个代码变体，我要指出的是第一个使用多维拉链，这在当今已知存在严重的性能问题。这可能是其与其他产品之间性能差异的主要原因。例如，我怀疑如果您使用forall (i,j) in Domain ...格式重写它，然后使用每维+/- 1索引编制，您会看到显着的改进（我猜想，性能可比得多）到第三种情况。

对于第三个，我很好奇您看到的好处是矢量化还是仅仅是多任务处理，因为您避免了第一个和第二个实现的性能问题。例如，是否检查过使用vectorizeOnly（）迭代器是否在没有迭代器的情况下对同一代码进行了性能提升（或使用二进制文件中的工具来检查是否发生了矢量化）？

在任何Chapel性能研究中，请确保抛出--fast编译器标志。同样，为了获得最佳矢量化效果，您可以尝试LLVM后端。

有没有办法在Chapel中自定义整个数组语句的默认并行化行为？

1 个答案: