并行与omp卡住

时间:2015-01-05 21:32:57

标签: c++ openmp

我遇到以下代码问题:

int *chosen_pts = new int[k];
std::pair<float, int> *dist2 = new std::pair<float, int>[x.n];
// initialize dist2
for (int i = 0; i < x.n; ++i) {
    dist2[i].first = std::numeric_limits<float>::max();
    dist2[i].second = i;
}

// choose the first point randomly
int ndx = 1;
chosen_pts[ndx - 1] = rand() % x.n;
double begin, end;
double elapsed_secs;
while (ndx < k) {
    float sum_distribution = 0.0;
    // look for the point that is furthest from any center
    begin = omp_get_wtime();
    #pragma omp parallel for reduction(+:sum_distribution)
    for (int i = 0; i < x.n; ++i) {

        int example = dist2[i].second;
        float d2 = 0.0, diff;
        for (int j = 0; j < x.d; ++j) {
            diff = x(example,j) - x(chosen_pts[ndx - 1],j);
            d2 += diff * diff;
        }
        if (d2 < dist2[i].first) {
            dist2[i].first = d2;
        }

        sum_distribution += dist2[i].first;

    }

    end = omp_get_wtime() - begin;

    std::cout << "center assigning -- " 
            << ndx << " of " << k << " = " 
            << (float)ndx / k * 100 
            << "% is done. Elasped time: "<< (float)end <<"\n";        

    /**/
    bool unique = true;

    do {
        // choose a random interval according to the new distribution
        float r = sum_distribution * (float)rand() / (float)RAND_MAX;
        float sum_cdf = dist2[0].first;
        int cdf_ndx = 0;
        while (sum_cdf < r) {
            sum_cdf += dist2[++cdf_ndx].first;
        }
        chosen_pts[ndx] = cdf_ndx;

        for (int i = 0; i < ndx; ++i) {
            unique = unique && (chosen_pts[ndx] != chosen_pts[i]);
        }
    } while (! unique);


    ++ndx;
}

正如你所看到的,我使用omp来使for循环并行。它工作正常,我可以实现显着的加速。但是,如果我将x.n的值增加到20000000以上,则该函数在8-10次循环后停止工作:

  • 它会产生任何输出(std :: cout)
  • 只有一个核心工作
  • 没有错误,无论如何

如果我注释掉执行循环,它会按预期再次运行。所有内核都很忙,每次迭代后都有一个输出,我可以根据需要增加k.n超过1亿。

1 个答案:

答案 0 :(得分:1)

它不是OpenMP并行卡住,显然是在你的串行do-while循环中。

我看到的一个特殊问题是访问while的内部dist2循环中没有数组边界检查。从理论上讲,绝不应该进行边界外访问;但在实践中它可能 - 见下面的原因。首先,我会重写cdf_ndx的计算,以保证在检查所有元素时循环结束:

    float sum_cdf = 0;
    int cdf_ndx = 0;
    while (sum_cdf < r && cdf_ndx < x.n ) {
        sum_cdf += dist2[cdf_ndx].first;
        ++cdf_ndx;
    }

现在,sum_cdf未达到r的情况如何?这是由于浮点运算的具体情况以及sum_distribution并行计算的事实,而sum_cdf是串行计算的。问题是一个元素对总和的贡献可能低于浮点数的准确度;换句话说,当你将两个不同于8个数量级的浮点值相加时,较小的值不会影响总和。

所以,在某个点之后有20M的浮点数,可能会发生下一个要添加的值与累积的sum_cdf相比较小的情况,添加此值并不会改变它!另一方面,sum_distribution基本上被计算为几个独立的部分和(每个线程一个)然后组合在一起。因此它更准确,并且可能比sum_cdf更大。

解决方案可以是部分计算sum_cdf,具有两个嵌套循环。例如:

    float sum_cdf = 0;
    int cdf_ndx = 0;
    while (sum_cdf < r && cdf_ndx < x.n ) {
        float block_sum = 0;
        int block_end = min(cdf_ndx+10000, x.n); // 10000 is arbitrary selected block size
        for (int i=cdf_ndx; i<block_end; ++i ) {
            block_sum += dist2[i].first;
            if( sum_cdf+block_sum >=r ) {
                block_end = i; // adjust to correctly compute cdf_ndx
                break;
            }
        }
        sum_cdf += block_sum;
        cdf_ndx = block_end;
    }

在循环之后,您需要检查cdf_ndx < x.n,否则以新的随机间隔重复。