我想比较两个三维数组与C ++ AMP之间的最大绝对差值。
使用OpenMP很容易。考虑2个阵列
float*** u, ***uOld;
代码是:
double residual = 0;
#pragma omp parallel for schedule(static) collapse(3) reduction(max : residual)
for (int i = 0; i < nbX; i++)
for (int j = 0; j < nbY; j++)
for (int k = 0; k < nbTheta; k++)
residual = std::max(residual, fabs(u[i][j][k] - uOld[i][j][k]));
使用AMP Algorithms的max_element会很容易,但是没有实现。我想到类似的东西,但外循环级别需要减少:
extent<1> extTheta(nbTheta);
parallel_for_each(extTheta, [=, &u_, &uOld_](index<1> iTheta) restrict(amp)
{
type residual = 0;
for (int iX = 0; iX < nbX; iX++)
for (int iY = 0; iY < nbY; iY++)
residual = fast_math::fmax(residual, fast_math::fabs(u_[iX][iY][iTheta] - uOld_[iX][iY][iTheta]));
})
数据在GPU上,出于效率原因,我不希望它在GPU上传输。如何有效地做到这一点?
答案 0 :(得分:0)
这是一个灵感来自msdn博客的解决方案:https://blogs.msdn.microsoft.com/nativeconcurrency/2012/03/08/parallel-reduction-using-c-amp
parallel_for_each(extent<3>(nbTheta, nbX, nbY), [=, &u_, &uOld_](index<3> idx) restrict(amp)
{
uOld_[idx[0]][idx[1]][idx[2]] = abs1(u_[idx[0]][idx[1]][idx[2]] - uOld_[idx[0]][idx[1]][idx[2]]);
});
array_view<float, 1> residualReduce_ = uOld_.view_as<1>(extent<1>(nbTheta*nbX*nbY));
array_view<float, 1> residual_ = residualReduce_.section(index<1>(0), extent<1>(1));
for (unsigned shift = nbTheta*nbX*nbY / 2; shift > 0; shift /= 2)
{
parallel_for_each(extent<1>(shift), [=](index<1> idx) restrict(amp)
{
residualReduce_[idx[0]] = fast_math::fmax(residualReduce_[idx[0]], residualReduce_[idx[0] + shift]);
if (shift % 2){ //If odd, each thread includes a shifted entry. One will match the end of the queue
residualReduce_[idx[0]] = fast_math::fmax(residualReduce_[idx[0]], residualReduce_[idx[0] + shift + 1]);
}
});
}
concurrency::copy(residual_, &residual);
parallel_for_each(extent<3>(nbTheta, nbX, nbY), [=, &u_, &uOld_](index<3> idx) restrict(amp)
{
uOld_[idx[0]][idx[1]][idx[2]] = u_[idx[0]][idx[1]][idx[2]];
})
与问题中的代码段不同,解决方案包括将uOld更新为U。
减少效率不是最高,但与其余代码相比仍然很快。