我遇到了一些关于如何处理大型矩阵的问题。就像解释in this other question一样,我有一个可以在大方阵上工作的程序(比如5k-10k)。计算部分是正确的(仍未100%优化),我用较小的方形矩阵(如256-512)进行了测试。这是我的代码:
#define N 10000
#define RADIUS 100
#define SQRADIUS RADIUS*RADIUS
#define THREADS 512
//many of these device functions are declared
__device__ unsigned char avg(const unsigned char *src, const unsigned int row, const unsigned int col) {
unsigned int sum = 0, c = 0;
//some work with radius and stuff
return sum;
}
__global__ void applyAvg(const unsigned char *src, unsigned char *dest) {
unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x, tmp = 0;
unsigned int stride = blockDim.x * gridDim.x;
int col = tid%N, row = (int)tid/N;
while(tid < N*N) {
if(row * col < N * N) {
//choose which of the __device__ functions needs to be launched
}
tid += stride;
col = tid%N, row = (int)tid/N;
}
__syncthreads();
}
int main( void ) {
cudaError_t err;
unsigned char *base, *thresh, *d_base, *d_thresh, *avg, *d_avg;
int i, j;
base = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
thresh = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
avg = (unsigned char*)malloc((N * N) * sizeof(unsigned char));
err = cudaMalloc((void**)&d_base, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 1"); exit(-1);}
err = cudaMalloc((void**)&d_thresh, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 2"); exit(-1);}
err = cudaMalloc((void**)&d_avg, (N * N) * sizeof(unsigned char));
if(err != cudaSuccess) {printf("ERROR 3"); exit(-1);}
for(i = 0; i < N * N; i++) {
base[i] = (unsigned char)(rand() % 256);
}
err = cudaMemcpy(d_base, base, (N * N) * sizeof(unsigned char), cudaMemcpyHostToDevice);
if(err != cudaSuccess){printf("ERROR 4"); exit(-1);}
//more 'light' stuff to do before the 'heavy computation'
applyAvg<<<(N + THREADS - 1) / THREADS, THREADS>>>(d_thresh, d_avg);
err = cudaMemcpy(thresh, d_thresh, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
if(err != cudaSuccess) {printf("ERROR 5"); exit(-1);}
err = cudaMemcpy(avg, d_avg, (N * N) * sizeof(unsigned char), cudaMemcpyDeviceToHost);
if(err != cudaSuccess) {printf("ERROR 6"); exit(-1);}
getchar();
return 0;
}
当使用大矩阵(如10000 x 10000)和半径100(这是我向前看的矩阵中每个点的'距离')启动问题时,需要花费很多时间。
我认为问题存在于applyAvg<<<(N + THREADS - 1) / THREADS, THREADS>>>
(我决定运行多少块和线程)和applyAvg(...)
方法(步幅和tid)中。
有人可以澄清一下,鉴于矩阵的大小从5k到10k不等,哪种块/线程可以决定推出的最佳方法是什么?
答案 0 :(得分:1)
我想你想要做的是图像过滤/卷积。根据您当前的cuda内核,您可以做两件事来提高性能。
使用2-D线程/块来避免/
和%
运算符。它们很慢。
使用共享内存减少全局内存带宽。
这是一篇关于图像卷积的白皮书。它展示了如何使用CUDA实现高性能的盒式文件管理器。
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf
Nvidia cuNPP库还提供了nppiFilterBox()和nppiFilterBox()函数,因此您不需要编写自己的内核。这是文档和示例。
http://docs.nvidia.com/cuda/cuda-samples/index.html#box-filter-with-npp
NPP doc pp.1009 http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf