我写了一个小测试函数,它的行为与我想要的不相符。
基本上,它应该读取一个数组并回写它的内容(稍后,当它工作时,它应该做更多,但现在即使失败了。)
调试GPU代码,我看到前几次迭代(以某种方式并行执行......这对于GPU来说可能有意义,但在我调试时让我感到惊讶)工作正常......但是之后,1-2之后Debug-Continues(F5),一些先前正确设置的值被0覆盖。我真的不明白..当我再次使用CPU时,很多值都是0,即使它们不应该是0(基本上,它们应该有原始数据,这是一个简单的测试序列)。
#include "stdafx.h"
#include <amp.h>
typedef unsigned char byte;
using namespace concurrency;
void AMPChangeBrightnessContrastWrapper2(byte* a, int len, float brightness, float contrast)
{
array_view<unsigned int> dst(len/4, (unsigned int*)a);
//dst.discard_data();
parallel_for_each(dst.extent, [=](index<1> idx) restrict(amp)
{
// split into bytes (in floats)
float temp1 = (dst[idx]) - (dst[idx] >> 8) * 256;
// this completely fails! float temp1 = dst[idx] & 0xFF;
float temp2 = (dst[idx] >> 8) - (dst[idx] >> 16) * 256;
float temp3 = (dst[idx] >> 16) - (dst[idx] >> 24) * 256;
float temp4 = (dst[idx] >> 24);
// convert back to int-array
dst[idx] = (int)(temp1 + temp2 * 256 + temp3 * 65536 + temp4 * 16777216);
});
//dst.synchronize();
}
int _tmain(int argc, _TCHAR* argv[])
{
const int size = 30000;
byte* a = new byte[size];
// generate some unique test sequence.. first 99 numbers are just 0..98
for (int i = 0; i < size; ++i)
a[i] = (byte)((i + i / 99) % 256);
AMPChangeBrightnessContrastWrapper2(a, size, -10.0f, 1.1f);
for (int i = 0; i < 50; ++i)
printf("%i, ", a[i]);
char out[20];
scanf_s("%s", out);
return 0;
}
如此简单(计划好)的步骤:
如果你想知道..那应该是颜色值..
结果是:
输出是(但应该只是从0开始增加数字):
0,1,2,3,0,5,6,7,0,9,10,11,16,13,14,15,0,17,18,19,32,21,22, 23,32,25,26,27,32,29,30,31,0,33,34,35,64,37,38,39,64,41,42, 43,44,45,46,47,64,49,
问题:
答案 0 :(得分:4)
•我想我不能创建一个包含字节的array_view,我必须使用整数或浮点数?
您无法创建byte的数组或array_view。 C ++ AMP仅支持有限的C ++类型子集。您可以使用纹理而不是数组视图。对于图像处理,这有几个优点,尤其是打包和拆包要快得多,因为它是由GPU的硬件实现的。请参阅下面的完整示例。
•最终评论出.synchronize没有改变任何东西 - 怎么回事?
您不需要dst.synchronize()
,因为dst array_view
超出了范围,导致数据隐式同步回CPU内存。顺便说一句,您不应该在函数开头调用dst.discard_data()
,因为如果这样做将意味着来自a
的数据不会被复制到GPU。
这是使用纹理&lt;&gt;的实现。注意事项:
代码......
void AMPChangeBrightnessContrastWrapper3(const byte* a, const int len,
const float brightness, const float contrast)
{
const int pixel_len = len / 4;
graphics::texture<graphics::uint_4, 1> inputTx(pixel_len, a, len, 8u);
graphics::texture<graphics::uint_4, 1> outputTx(pixel_len, 8u);
graphics::writeonly_texture_view<graphics::uint_4, 1> outputTxVw(outputTx);
parallel_for_each( outputTxVw.extent, [=, &inputTx, &outputTx](index<1> idx)
restrict(amp)
{
const graphics::uint_4 v = inputTx[idx];
float tmp = static_cast<float>(v.r);
tmp = (tmp - 128) * contrast + brightness + 128;
tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
const unsigned int temp1_ = static_cast<unsigned int>(tmp);
tmp = static_cast<float>(v.g);
tmp = (tmp - 128) * contrast + brightness + 128;
tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
const unsigned int temp2_ = static_cast<unsigned int>(tmp);
tmp = static_cast<float>(v.b);
tmp = (tmp - 128) * contrast + brightness + 128;
tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
const unsigned int temp3_ = static_cast<unsigned int>(tmp);
tmp = static_cast<float>(v.a);
tmp = (tmp - 128) * contrast + brightness + 128;
tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
const unsigned int temp4_ = static_cast<unsigned int>(tmp);
outputTxVw.set(idx, graphics::uint_4(temp1_, temp2_, temp3_, temp4_));
});
copy(outputTx, (void*)a, len);
}
您可以在AMP Book
中找到更多C ++ AMP示例答案 1 :(得分:0)
嗯..所以,经过一些更多的试验,错误和错误后回答我自己的问题:
如果您遇到类似或需要的东西,这里是解决方案(按照原定的意图),在亮度值数组中改变亮度和对比度:
void AMPChangeBrightnessContrastWrapper
(byte* a, int len, float brightness, float contrast)
{
array_view<unsigned int> dst(len/4, (unsigned int*)a);
parallel_for_each(dst.extent, [=](index<1> idx) restrict(amp)
{
float temp1 = dst[idx] & 0xFF;
temp1 = (temp1 - 128) * contrast + brightness + 128;
if (temp1 < 0)
temp1 = 0;
if (temp1 > 255)
temp1 = 255;
float temp2 = (dst[idx] >> 8) & 0xFF;
temp2 = (temp2 - 128) * contrast + brightness + 128;
if (temp2 < 0)
temp2 = 0;
if (temp2 > 255)
temp2 = 255;
float temp3 = (dst[idx] >> 16) & 0xFF;
temp3 = (temp3 - 128) * contrast + brightness + 128;
if (temp3 < 0)
temp3 = 0;
if (temp3 > 255)
temp3 = 255;
float temp4 = (dst[idx] >> 24);
temp4 = (temp4 - 128) * contrast + brightness + 128;
if (temp4 < 0)
temp4 = 0;
if (temp4 > 255)
temp4 = 255;
unsigned int temp1_ = (unsigned int)temp1;
unsigned int temp2_ = (unsigned int)temp2;
unsigned int temp3_ = (unsigned int)temp3;
unsigned int temp4_ = (unsigned int)temp4;
unsigned int res = temp1_ + (temp2_ << 8) + (temp3_ << 16) + (temp4_ << 24);
dst[idx] = res;
});
dst.synchronize();
}
此外,即使我(我想)做了一些计算,这比使用Intel HD 4000的CPU快了2-4倍(发布/调试)。