Question

我正在编写一个函数来创建一个高斯滤波器（使用犰狳库），它可以是2D或3D，具体取决于它接收的输入的维数。这是代码：

template <class ty>
ty gaussianFilter(const ty& input, double sigma)
{
    // Our filter will be initialized to the same size as our input.
    ty filter = ty(input); // Copy constructor.

    uword nRows = filter.n_rows;
    uword nCols = filter.n_cols;
    uword nSlic = filter.n_elem / (nRows*nCols); // If 2D, nSlic == 1.

    // Offsets with respect to the middle.
    double rowOffset = static_cast<double>(nRows/2);
    double colOffset = static_cast<double>(nCols/2);
    double sliceOffset = static_cast<double>(nSlic/2);

    // Counters.
    double x = 0 , y = 0, z = 0;

for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) {
      x = static_cast<double>(rowIndex) - rowOffset;
      for (uword colIndex = 0; colIndex < nCols; colIndex++) {
        y = static_cast<double>(colIndex) - colOffset;
        for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) {
          z = static_cast<double>(sliIndex) - sliceOffset;
          // If-statement inside for-loop looks terribly inefficient
          // but the compiler should take care of this.
          if (nSlic == 1){ // If 2D, Gauss filter for 2D.
            filter(rowIndex*nCols + colIndex) = ...
          }
          else
          { // Gauss filter for 3D. 
            filter((rowIndex*nCols + colIndex)*nSlic + sliIndex) = ...
          }
       }    
     }
 }

正如我们所见，在最内层循环中有一个if语句，它检查第三维（nSlic）的大小是否等于1.一旦在函数的开头计算，nSlic将不会改变它值，因此编译器应该足够聪明以优化条件分支，并且我不应该失去任何性能。

但是......如果我从循环中删除if语句，我会获得性能提升。

if (nSlic == 1)
  { // Gauss filter for 2D.
    for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) {
      x = static_cast<double>(rowIndex) - rowOffset;
      for (uword colIndex = 0; colIndex < nCols; colIndex++) {
        y = static_cast<double>(colIndex) - colOffset;
        for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) {
          z = static_cast<double>(sliIndex) - sliceOffset;
          {filter(rowIndex*nCols + colIndex) = ...
        }
      } 
    }
  }
else
  {
    for (uword rowIndex = 0; rowIndex < nRows; rowIndex++) {
      x = static_cast<double>(rowIndex) - rowOffset;
      for (uword colIndex = 0; colIndex < nCols; colIndex++) {
        y = static_cast<double>(colIndex) - colOffset;
        for (uword sliIndex = 0; sliIndex < nSlic; sliIndex++) {
          z = static_cast<double>(sliIndex) - sliceOffset;
          {filter((rowIndex*nCols + colIndex)*nSlic + sliIndex) = ...                                     
        }
      } 
    }
  }

在使用g++ -O3 -c -o main.o main.cpp进行编译并测量两个代码变体的执行时间后，我得到以下结果：
（1000次重复，尺寸为2048的2D矩阵）

If-inside：

66.0453秒
64.7701秒

如果-外

64.0148秒
63.6808秒

如果nSlic的值甚至没有改变，为什么编译器不优化分支？我必须重构代码以避免if - 循环中的for - 语句？

Answer 1

您的错误在这里：

优化条件分支，我不应该丢失任何性能

与实际执行与未知分支相关联的管道停顿相比，分支预测可能对您有很大帮助。但它仍然是管道中的额外指令，仍然有成本。处理器魔术降低了无用代码的成本......大大减少但不是零。

Answer 2

在循环中有一个额外的变量会影响寄存器的使用，这可能会影响时序，即使分支预测工作正常。您需要查看生成的程序集才能知道。它也可能影响难以检测的缓存命中率。

Answer 3

编译器和硬件之间的相互作用是这样的 - 编译器可能能够优化分支，使代码本身得到优化，但正如您所看到的，这会产生大量代码膨胀，因为它有效地复制了整个循环。默认情况下，某些编译器可能会包含此优化，而其他编译器可能需要明确询问它是否已完成。

或者，如果编译器避免了这种优化，代码将保留分支，并且硬件可以尽可能地预测它。这涉及复杂的分支预测器，其具有有限的表格，因此它们可以达到的学习量受到限制。在这个例子中，你没有太多的竞争分支（循环，函数调用和返回，以及我们正在讨论的），但是我们没有看到函数的内部工作被调用，它可能有更多的分支指令（清除你在外面学到的东西），或者它可能足够长，以清除预测者可能正在使用的任何全球历史。很难说没有看到代码，也不知道你的分支预测器到底做了什么（这取决于你使用的CPU版本）。

还有一点需要注意 - 它可能不一定与分支预测有关，更改代码可能会改变代码缓存中的对齐或用于优化循环的一些内部循环缓冲区（例如this），这可能是导致表现发生巨大变化。唯一可以知道的方法是根据硬件计数器（perf，vtune等）运行一些分析，并测量分支数量和错误预测的变化。

为什么我不是分支预测的受害者？

3 个答案: