Question

我遇到以下代码问题：

int *chosen_pts = new int[k];
std::pair<float, int> *dist2 = new std::pair<float, int>[x.n];
// initialize dist2
for (int i = 0; i < x.n; ++i) {
    dist2[i].first = std::numeric_limits<float>::max();
    dist2[i].second = i;
}

// choose the first point randomly
int ndx = 1;
chosen_pts[ndx - 1] = rand() % x.n;
double begin, end;
double elapsed_secs;
while (ndx < k) {
    float sum_distribution = 0.0;
    // look for the point that is furthest from any center
    begin = omp_get_wtime();
    #pragma omp parallel for reduction(+:sum_distribution)
    for (int i = 0; i < x.n; ++i) {

        int example = dist2[i].second;
        float d2 = 0.0, diff;
        for (int j = 0; j < x.d; ++j) {
            diff = x(example,j) - x(chosen_pts[ndx - 1],j);
            d2 += diff * diff;
        }
        if (d2 < dist2[i].first) {
            dist2[i].first = d2;
        }

        sum_distribution += dist2[i].first;

    }

    end = omp_get_wtime() - begin;

    std::cout << "center assigning -- " 
            << ndx << " of " << k << " = " 
            << (float)ndx / k * 100 
            << "% is done. Elasped time: "<< (float)end <<"\n";        

    /**/
    bool unique = true;

    do {
        // choose a random interval according to the new distribution
        float r = sum_distribution * (float)rand() / (float)RAND_MAX;
        float sum_cdf = dist2[0].first;
        int cdf_ndx = 0;
        while (sum_cdf < r) {
            sum_cdf += dist2[++cdf_ndx].first;
        }
        chosen_pts[ndx] = cdf_ndx;

        for (int i = 0; i < ndx; ++i) {
            unique = unique && (chosen_pts[ndx] != chosen_pts[i]);
        }
    } while (! unique);


    ++ndx;
}

正如你所看到的，我使用omp来使for循环并行。它工作正常，我可以实现显着的加速。但是，如果我将x.n的值增加到20000000以上，则该函数在8-10次循环后停止工作：

它会产生任何输出（std :: cout）
只有一个核心工作
没有错误，无论如何

如果我注释掉执行循环，它会按预期再次运行。所有内核都很忙，每次迭代后都有一个输出，我可以根据需要增加k.n超过1亿。

Answer 1

它不是OpenMP并行卡住，显然是在你的串行do-while循环中。

我看到的一个特殊问题是访问while的内部dist2循环中没有数组边界检查。从理论上讲，绝不应该进行边界外访问;但在实践中它可能 - 见下面的原因。首先，我会重写cdf_ndx的计算，以保证在检查所有元素时循环结束：

    float sum_cdf = 0;
    int cdf_ndx = 0;
    while (sum_cdf < r && cdf_ndx < x.n ) {
        sum_cdf += dist2[cdf_ndx].first;
        ++cdf_ndx;
    }

现在，sum_cdf未达到r的情况如何？这是由于浮点运算的具体情况以及sum_distribution并行计算的事实，而sum_cdf是串行计算的。问题是一个元素对总和的贡献可能低于浮点数的准确度;换句话说，当你将两个不同于8个数量级的浮点值相加时，较小的值不会影响总和。

所以，在某个点之后有20M的浮点数，可能会发生下一个要添加的值与累积的sum_cdf相比较小的情况，添加此值并不会改变它！另一方面，sum_distribution基本上被计算为几个独立的部分和（每个线程一个）然后组合在一起。因此它更准确，并且可能比sum_cdf更大。

解决方案可以是部分计算sum_cdf，具有两个嵌套循环。例如：

    float sum_cdf = 0;
    int cdf_ndx = 0;
    while (sum_cdf < r && cdf_ndx < x.n ) {
        float block_sum = 0;
        int block_end = min(cdf_ndx+10000, x.n); // 10000 is arbitrary selected block size
        for (int i=cdf_ndx; i<block_end; ++i ) {
            block_sum += dist2[i].first;
            if( sum_cdf+block_sum >=r ) {
                block_end = i; // adjust to correctly compute cdf_ndx
                break;
            }
        }
        sum_cdf += block_sum;
        cdf_ndx = block_end;
    }

在循环之后，您需要检查cdf_ndx < x.n，否则以新的随机间隔重复。

并行与omp卡住

1 个答案: