设置

Question

我在C ++中进行了一些科学计算，并尝试利用OpenMP来实现某些循环的并行化。这到目前为止运作良好，例如在带有8个线程的Intel i7-4770上。

设置

我们在一个主板上有一个由两个Intel CPU（E5-2680v2）组成的小型工作站。代码可以工作，只要它在1个CPU上运行，并且拥有尽可能多的线程。但是一旦我使用第二个CPU，我会不时观察到不正确的结果（大约每50到100次运行代码）。即使我只使用2个线程并将它们分配给两个不同的CPU，也会发生这种情况。由于我们有5个这样的工作站（都是相同的），我在每个工作站上都运行了代码，并且都显示了这个问题。

工作站在OpenSuse 13.1，内核3.11.10-7上运行。问题存在于g ++ 4.8.1和4.9.0，以及英特尔的icc 13.1.3.192（尽管问题并不常见于icc，但它仍然存在）。

症状

症状可描述如下：

我有一大堆std :: complex：std::complex<double>* mFourierValues;
在循环中，我访问并设置每个元素。每次迭代访问一个不同的元素，所以我没有并发访问（我检查过这个）：mFourierValues[idx] = newValue;
如果我之后将set array-value与输入值进行比较，大致为mFourierValues[idx] == newValue，则此检查会不时失败（尽管不是每次结果都不正确）。

因此症状看起来像是在没有任何同步的情况下同时访问元素。但是，当我将索引存储在std::vector（具有适当的#pragma omp critical）时，所有指标都是独特的，并且在正确的范围内。

问题

经过几天的调试，我怀疑其他事情正在发生，我的代码是正确的。对我来说，当CPU将缓存与主内存同步时，看起来很奇怪。

因此，我的问题是：

OpenMP甚至可以用于这样的系统吗？（我还没找到一个说不的来源。）
是否存在针对这种情况的已知错误（我还没有在错误跟踪器中找到任何错误）？
您认为问题可能在哪里？
- 我的代码（在具有多个内核的1个CPU上似乎运行良好！），
- 编译器（gcc，icc both！），
- 操作系统，
- 硬件（所有5个工作站上的缺陷？）

代码

[编辑：删除旧代码，见下文]

使用最少示例编辑

好的，我终于能够生成一个更短（和自我一致）的代码示例。

关于代码

保留一些内存空间。对于堆栈上的数组，可以访问：complex<double> mAllElements[tensorIdx][kappa1][kappa2][kappa3]。即我有3个Rank-3-tensors（tensorIdx）。每个张量代表一个三维数组，由kappa1，kappa2和kappa3索引。
我有4个嵌套循环（在所有4个索引上），而kappa1循环是被并行化的循环（并且是最外面的循环）。它们位于DoComputation()。
在main()中，我拨打DoComputation()一次以获取一些参考值，然后我多次调用它并比较结果。他们应该完全匹配，但有时他们不会。

不幸的是，代码仍然长约190行。我试图进一步简化（只有1张等级1，等等），但后来我再也无法重现这个问题了。我想这看起来是因为内存访问是非对齐的（tensorIdx上的循环是最内层的）（我知道，这远非最优。）

此外，在适当的地方需要一些延迟，以重现错误。这就是nops()电话的原因。没有它们，代码运行得更快，但到目前为止还没有显示出问题。

请注意，我再次检查了关键部分CalcElementIdx()，并认为它是正确的（每个元素都被访问一次）。我还运行了valgrind的memcheck，helgrind和drd（使用适当的重新编译的libgomp），这三个都没有出错。

输出

程序的每两到三次开始，我得到一两个不匹配。示例输出：

41      Is exactly 0
42      Is exactly 0
43      Is exactly 0
44      Is exactly 0
45      348496
46      Is exactly 0
47      Is exactly 0
48      Is exactly 0
49      Is exactly 0

对于gcc和icc来说都是如此。

我的问题

我的问题是：下面的代码对您来说是否正确？（除了明显的设计缺陷。）（如果时间过长，我会尝试进一步减少它，但如上所述我到目前为止失败了。）

代码

代码是用

编译的

g++ main.cc -O3 -Wall -Wextra -fopenmp

或

icc main.cc -O3 -Wall -Wextra -openmp

两个版本在总共40个线程的2个CPU上运行时显示所描述的问题。我无法观察到1个CPU上的错误（以及我喜欢的多个线程）。

// File: main.cc
#include <cmath>
#include <iostream>
#include <fstream>
#include <complex>
#include <cassert>
#include <iomanip>
#include <omp.h>

using namespace std;


// If defined: We add some nops in certain places, to get the timing right.
// Without them, I haven't observed the bug.
#define ENABLE_NOPS

// The size of each of the 3 tensors is: GRID_SIZE x GRID_SIZE x GRID_SIZE
static const int GRID_SIZE = 60;

//=============================================
// Produces several nops. Used to get correct "timings".

//----
template<int N> __attribute__((always_inline)) inline void nop()
{
    nop<N-1>();
    asm("nop;");
}

//----
template<> inline void nop<0>() { }

//----
__attribute__((always_inline)) inline void nops()
{
    nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>();
}




//=============================================
/*
Memory layout: We have 3 rank-3-tensors, i.e. 3 arrays of dimension 3.
The layout looks like this: complex<double> allElements[tensorIdx][kappa1][kappa2][kappa3];
The kappas represent the indices into a certain tensor, and are all in the interval [0; GRID_SIZE-1].
*/
class MemoryManagerFFTW
{
public:
  //---------- Constructor ----------
  MemoryManagerFFTW()
  {
    mAllElements = new complex<double>[GetTotalNumElements()];
  }

  //---------- Destructor ----------
  ~MemoryManagerFFTW() 
  { 
    delete[] mAllElements; 
  }

  //---------- SetElement ----------
  void SetElement(int tensorIdx, int kappa1, int kappa2, int kappa3, const complex<double>& newVal)
  {
    // Out-of-bounds error checks are done in this function.
    const size_t idx = CalcElementIdx(tensorIdx, kappa1, kappa2, kappa3);

    // These nops here are important to reproduce the bug.
#if defined(ENABLE_NOPS)
    nops();
    nops();
#endif

    // A flush makes the bug appear more often.
    // #pragma omp flush
    mAllElements[idx] = newVal;

    // This was never false, although the same check is false in DoComputation() from time to time.
    assert(newVal == mAllElements[idx]);
  }

  //---------- GetElement ----------
  const complex<double>& GetElement(int tensorIdx, int kappa1, int kappa2, int kappa3)const
  {  
    const size_t idx = CalcElementIdx(tensorIdx, kappa1, kappa2, kappa3);
    return mAllElements[idx];
  }


  //---------- CalcElementIdx ----------
  size_t CalcElementIdx(int tensorIdx, int kappa1, int kappa2, int kappa3)const
  {
    // We have 3 tensors (index by "tensorIdx"). Each tensor is of rank 3. In memory, they are placed behind each other.
    // tensorStartIdx is the index of the first element in the tensor.
    const size_t tensorStartIdx = GetNumElementsPerTensor() * tensorIdx;

    // Index of the element relative to the beginning of the tensor. A tensor is a 3dim. array of size GRID_SIZE x GRID_SIZE x GRID_SIZE
    const size_t idxInTensor = kappa3 + GRID_SIZE * (kappa2 + GRID_SIZE * kappa1);

    const size_t finalIdx = tensorStartIdx + idxInTensor;
    assert(finalIdx < GetTotalNumElements());

    return finalIdx;
  }


  //---------- GetNumElementsPerTensor & GetTotalNumElements ----------
  size_t GetNumElementsPerTensor()const { return GRID_SIZE * GRID_SIZE * GRID_SIZE; }
  size_t GetTotalNumElements()const { return NUM_TENSORS * GetNumElementsPerTensor(); }



public:
  static const int NUM_TENSORS = 3; // The number of tensors.
  complex<double>* mAllElements; // All tensors. An array [tensorIdx][kappa1][kappa2][kappa3]
};




//=============================================
void DoComputation(MemoryManagerFFTW& mSingleLayerManager)
{
  // Parallize outer loop.
  #pragma omp parallel for
  for (int kappa1 = 0; kappa1 < GRID_SIZE; ++kappa1)
  {
    for (int kappa2 = 0; kappa2 < GRID_SIZE; ++kappa2)
    {
      for (int kappa3 = 0; kappa3 < GRID_SIZE; ++kappa3)
      {    
#ifdef ENABLE_NOPS
        nop<50>();
#endif
        const double k2 = kappa1*kappa1 + kappa2*kappa2 + kappa3*kappa3;
        for (int j = 0; j < 3; ++j)
        {
          // Compute and set new result.
          const complex<double> curElement = mSingleLayerManager.GetElement(j, kappa1, kappa2, kappa3);
          const complex<double> newElement = exp(-k2) * k2 * curElement;

          mSingleLayerManager.SetElement(j, kappa1, kappa2, kappa3, newElement);

          // Check if the results has been set correctly. This is sometimes false, but _not_ always when the result is incorrect.
          const complex<double> test = mSingleLayerManager.GetElement(j, kappa1, kappa2, kappa3);
          if (test != newElement)
            printf("Failure: (%g, %g) != (%g, %g)\n", test.real(), test.imag(), newElement.real(), newElement.imag());
        }
      }
    }
  }
}



//=============================================
int main()
{
  cout << "Max num. threads: " << omp_get_max_threads() << endl;

  // Call DoComputation() once to get a reference-array.
  MemoryManagerFFTW reference;
  for (size_t i = 0; i < reference.GetTotalNumElements(); ++i)
    reference.mAllElements[i] = complex<double>((double)i, (double)i+0.5);
  DoComputation(reference);

  // Call DoComputation() several times, and each time compare the result to the reference.
  const size_t NUM = 1000;
  for (size_t curTry = 0; curTry < NUM; ++curTry)
  {
    MemoryManagerFFTW mSingleLayerManager;
    for (size_t i = 0; i < mSingleLayerManager.GetTotalNumElements(); ++i)
      mSingleLayerManager.mAllElements[i] = complex<double>((double)i, (double)i+0.5);
    DoComputation(mSingleLayerManager);

    // Get the max. difference. This *should* be 0, but isn't from time to time.
    double maxDiff = -1;
    for (size_t i = 0; i < mSingleLayerManager.GetTotalNumElements(); ++i)
    {
      const complex<double> curDiff = mSingleLayerManager.mAllElements[i] - reference.mAllElements[i];
      maxDiff = max(maxDiff, max(curDiff.real(), curDiff.imag()));
    }

    if (maxDiff != 0)
      cout << curTry << "\t" << maxDiff << endl;
    else
      cout << curTry << "\t" << "Is exactly 0" << endl;
  }

  return 0;
}

修改

从下面的评论和Zboson的答案中可以看出，内核3.11.10-7中存在一个错误。更新到3.15.0-1后，我的问题就消失了，代码就可以了。

Answer 1

问题是由于Linux内核内核3.11.10-7中的错误造成的。正如赫里斯托·伊利耶夫所指出的The bug may be due to how the kernel handles invalidating the TLB cache。我猜测内核可能是问题，因为我读到Linux Kernel 3.15 for NUMA systems会有一些改进，所以我认为内核版本对于NUMA系统很重要。

当OP将他的NUMA系统的Linux内核更新到3.15.0-1时，问题就消失了。

双插槽系统上的OpenMP

设置

症状

问题

代码

使用最少示例编辑

关于代码

输出

我的问题

代码

修改

1 个答案: