我在理解如何向Cuda发送2D阵列方面遇到了一些麻烦。我有一个程序可以解析每行上有30个数据点的大文件。我一次读了大约10行,然后为每一行和项创建一个矩阵(所以在我的10行中有30个数据点的例子,它将是int list[10][30];
我的目标是将这个数组发送到我的核心和让每个块处理一行(我已经让它在正常的C中完美地工作,但是Cuda更具挑战性。)
这是我到目前为止所做的但没有运气(注意:sizeofbucket = rows,sizeOfBucketsHoldings =行中的项目......我知道我应该赢得奇数变量名称的奖励):
int list[sizeOfBuckets][sizeOfBucketsHoldings]; //this is created at the start of the file and I can confirmed its filled with the correct data
#define sizeOfBuckets 10 //size of buckets before sending to process list
#define sizeOfBucketsHoldings 30
//Cuda part
//define device variables
int *dev_current_list[sizeOfBuckets][sizeOfBucketsHoldings];
//time to malloc the 2D array on device
size_t pitch;
cudaMallocPitch((int**)&dev_current_list, (size_t *)&pitch, sizeOfBucketsHoldings * sizeof(int), sizeOfBuckets);
//copy data from host to device
cudaMemcpy2D( dev_current_list, pitch, list, sizeOfBuckets * sizeof(int), sizeOfBuckets * sizeof(int), sizeOfBucketsHoldings * sizeof(int),cudaMemcpyHostToDevice );
process_list<<<count,1>>> (sizeOfBuckets, sizeOfBucketsHoldings, dev_current_list, pitch);
//free memory of device
cudaFree( dev_current_list );
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, int pitch) {
int tid = blockIdx.x;
for (int r = 0; r < sizeOfBuckets; ++r) {
int* row = (int*)((char*)current_list + r * pitch);
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
int element = row[c];
}
}
我得到的错误是:
main.cu(266): error: argument of type "int *(*)[30]" is incompatible with parameter of type "int *"
1 error detected in the compilation of "/tmp/tmpxft_00003f32_00000000-4_main.cpp1.ii".
第266行是内核调用process_list<<<count,1>>> (count, countListItem, dev_current_list, pitch);
我认为问题是我试图在我的函数中创建我的数组作为int *但我怎么能创建它?在我的纯C代码中,我使用int current_list[num_of_rows][num_items_in_row]
,但是我无法在Cuda中使用相同的结果。
我的最终目标很简单我只想让每个块处理每一行(sizeOfBuckets),然后让它遍历该行中的所有项目(sizeOfBucketHoldings)。我只是做了一个正常的cudamalloc和cudaMemcpy,但它没有工作,所以我环顾四周,发现了关于MallocPitch和2dcopy(两者都不在我的cuda by example
书中)我一直试图研究例子但是他们似乎给了我同样的错误(我正在阅读CUDA_C编程指南在第22页上找到了这个想法,但仍然没有运气)。有任何想法吗?或建议在哪里看?
编辑: 为了测试这个,我只想将每一行的值加在一起(我通过示例数组添加示例复制了cuda中的逻辑)。 我的内核:
__global__ void process_list(int sizeOfBuckets, int sizeOfBucketsHoldings, int *current_list, size_t pitch, int *total) {
//TODO: we need to flip the list as well
int tid = blockIdx.x;
for (int c = 0; c < sizeOfBucketsHoldings; ++c) {
total[tid] = total + current_list[tid][c];
}
}
以下是我在main中声明总数的方法:
int *dev_total;
cudaMalloc( (void**)&dev_total, sizeOfBuckets * sizeof(int) );
答案 0 :(得分:3)
您的代码中存在一些错误。
此示例可以帮助您进行内存分配:
__global__ void process_list(int sizeOfBucketsHoldings, int* total, int* current_list, int pitch)
{
int tid = blockIdx.x;
total[tid] = 0;
for (int c = 0; c < sizeOfBucketsHoldings; ++c)
{
total[tid] += *((int*)((char*)current_list + tid * pitch) + c);
}
}
int main()
{
size_t sizeOfBuckets = 10;
size_t sizeOfBucketsHoldings = 30;
size_t width = sizeOfBucketsHoldings * sizeof(int);//ned to be in bytes
size_t height = sizeOfBuckets;
int* list = new int [sizeOfBuckets * sizeOfBucketsHoldings];// one dimensional
for (int i = 0; i < sizeOfBuckets; i++)
for (int j = 0; j < sizeOfBucketsHoldings; j++)
list[i *sizeOfBucketsHoldings + j] = i;
size_t pitch_h = sizeOfBucketsHoldings * sizeof(int);// always in bytes
int* dev_current_list;
size_t pitch_d;
cudaMallocPitch((int**)&dev_current_list, &pitch_d, width, height);
int *test;
cudaMalloc((void**)&test, sizeOfBuckets * sizeof(int));
int* h_test = new int[sizeOfBuckets];
cudaMemcpy2D(dev_current_list, pitch_d, list, pitch_h, width, height, cudaMemcpyHostToDevice);
process_list<<<10, 1>>>(sizeOfBucketsHoldings, test, dev_current_list, pitch_d);
cudaDeviceSynchronize();
cudaMemcpy(h_test, test, sizeOfBuckets * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < sizeOfBuckets; i++)
printf("%d %d\n", i , h_test[i]);
return 0;
}
要在内核中访问2D数组,您应该使用模式base_addr + y * pitch_d + x
。
警告:pitvh总是以字节为单位。你需要将指针指向byte*
。