我遇到以下代码问题:
int *chosen_pts = new int[k];
std::pair<float, int> *dist2 = new std::pair<float, int>[x.n];
// initialize dist2
for (int i = 0; i < x.n; ++i) {
dist2[i].first = std::numeric_limits<float>::max();
dist2[i].second = i;
}
// choose the first point randomly
int ndx = 1;
chosen_pts[ndx - 1] = rand() % x.n;
double begin, end;
double elapsed_secs;
while (ndx < k) {
float sum_distribution = 0.0;
// look for the point that is furthest from any center
begin = omp_get_wtime();
#pragma omp parallel for reduction(+:sum_distribution)
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
sum_distribution += dist2[i].first;
}
end = omp_get_wtime() - begin;
std::cout << "center assigning -- "
<< ndx << " of " << k << " = "
<< (float)ndx / k * 100
<< "% is done. Elasped time: "<< (float)end <<"\n";
/**/
bool unique = true;
do {
// choose a random interval according to the new distribution
float r = sum_distribution * (float)rand() / (float)RAND_MAX;
float sum_cdf = dist2[0].first;
int cdf_ndx = 0;
while (sum_cdf < r) {
sum_cdf += dist2[++cdf_ndx].first;
}
chosen_pts[ndx] = cdf_ndx;
for (int i = 0; i < ndx; ++i) {
unique = unique && (chosen_pts[ndx] != chosen_pts[i]);
}
} while (! unique);
++ndx;
}
正如你所看到的,我使用omp来使for循环并行。它工作正常,我可以实现显着的加速。但是,如果我将x.n
的值增加到20000000以上,则该函数在8-10次循环后停止工作:
如果我注释掉执行循环,它会按预期再次运行。所有内核都很忙,每次迭代后都有一个输出,我可以根据需要增加k.n
超过1亿。
答案 0 :(得分:1)
它不是OpenMP并行卡住,显然是在你的串行do-while循环中。
我看到的一个特殊问题是访问while
的内部dist2
循环中没有数组边界检查。从理论上讲,绝不应该进行边界外访问;但在实践中它可能 - 见下面的原因。首先,我会重写cdf_ndx
的计算,以保证在检查所有元素时循环结束:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
sum_cdf += dist2[cdf_ndx].first;
++cdf_ndx;
}
现在,sum_cdf
未达到r
的情况如何?这是由于浮点运算的具体情况以及sum_distribution
并行计算的事实,而sum_cdf
是串行计算的。问题是一个元素对总和的贡献可能低于浮点数的准确度;换句话说,当你将两个不同于8个数量级的浮点值相加时,较小的值不会影响总和。
所以,在某个点之后有20M的浮点数,可能会发生下一个要添加的值与累积的sum_cdf
相比较小的情况,添加此值并不会改变它!另一方面,sum_distribution
基本上被计算为几个独立的部分和(每个线程一个)然后组合在一起。因此它更准确,并且可能比sum_cdf
更大。
解决方案可以是部分计算sum_cdf
,具有两个嵌套循环。例如:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
float block_sum = 0;
int block_end = min(cdf_ndx+10000, x.n); // 10000 is arbitrary selected block size
for (int i=cdf_ndx; i<block_end; ++i ) {
block_sum += dist2[i].first;
if( sum_cdf+block_sum >=r ) {
block_end = i; // adjust to correctly compute cdf_ndx
break;
}
}
sum_cdf += block_sum;
cdf_ndx = block_end;
}
在循环之后,您需要检查cdf_ndx < x.n
,否则以新的随机间隔重复。