好的,所以这个任务的主要思想是计算多个图像的平均值,我让它以正常的方式运行,所以我想我会用CUDA去试试,但不幸的是我在输出中得到的是第一张图片而不是平均图片。 (在内核中我也尝试将一些像素设置为0以确保发生了某些事情,但没有运气..)
////My kernel:
//nImages - number of images in the memory
//nBytes - number of pixels*color per image (also it's a size of dataOut)
//nImages*nBytes gives us the size of dataIn
//nBatch - dataIn has 1 milion bytes per image, we run in 6144 threads, so we need 163 batches to calc the whole dataOut
__global__
void avg_arrays(unsigned char* cuDataIn, unsigned char* cuDataOut, int nImages, int nBytes, int nBatch)
{
//get the position of the correct byte
int j = threadIdx.x + nBatch;
//if we're outside of image then give up
if(j >= nBytes) return;
//proceed averaging
long lSum = 0;
for(int i=0; i < nImages; ++i)
lSum += cuDataIn[i*nBytes + j];
lSum = lSum / nImages;
cuDataOut[j] = lSum;
}
内存分配等。
unsigned char* dataIn = 0;
unsigned char* dataOut= 0;
// Allocate and Transfer memory to the devicea
gpuErrchk( cudaMalloc((void**)&dataIn, nPixelCountBGR * nNumberOfImages * sizeof(unsigned char))); //dataIn
gpuErrchk( cudaMalloc((void**)&dataOut, nPixelCountBGR * sizeof(unsigned char))); //dataOut
gpuErrchk( cudaMemcpy(dataIn, bmps, nPixelCountBGR * nNumberOfImages * sizeof(unsigned char), cudaMemcpyHostToDevice )); //dataIn
gpuErrchk( cudaMemcpy(dataOut, basePixels, nPixelCountBGR * sizeof(unsigned char), cudaMemcpyHostToDevice )); //dataOut
// Perform the array addition
dim3 dimBlock(N);
dim3 dimGrid(1);
//do it in batches, unless it's possible to run more threads at once, anyway N is a number of max threads
for(int i=0; i<nPixelCountBGR; i+=N){
cout << "Running with: nImg: "<< nNumberOfImages << ", nPixBGR " << nPixelCountBGR << ", and i = " << i << endl;
avg_arrays<<<dimGrid, dimBlock>>>(dataIn, dataOut, nNumberOfImages, nPixelCountBGR, 0);
}
// Copy the Contents from the GPU
gpuErrchk(cudaMemcpy(basePixels, dataOut, nPixelCountBGR * sizeof(unsigned char), cudaMemcpyDeviceToHost));
gpuErrchk(cudaFree(dataOut));
gpuErrchk(cudaFree(dataIn));
错误检查不会带来任何消息,所有代码都能顺利运行,最后得到的只是第一张图片的精确副本。
以防万一有人需要这里的控制台输出:
Running with: nImg: 29, nPixBGR 1228800, and i = 0
...
Running with: nImg: 29, nPixBGR 1228800, and i = 1210368
Running with: nImg: 29, nPixBGR 1228800, and i = 1216512
Running with: nImg: 29, nPixBGR 1228800, and i = 1222656
Time of averaging: 0.219
答案 0 :(得分:1)
如果N
大于512或1024(取决于你正在运行的GPU,你没有提到),那么这是无效的:
dim3 dimBlock(N);
因为您无法启动每个块大于512或1024个线程的内核:
avg_arrays<<<dimGrid, dimBlock>>>(...
^
|
this is limited to 512 or 1024
如果您学习proper cuda error checking并将其应用于内核启动,则会陷入此类错误。