我是CUDA C的新手,我正在尝试并行化slave_sort函数的以下代码片段,您将意识到这已经与使用posix线程并行了。 我有以下结构:
typedef struct{
long densities[MAX_RADIX];
long ranks[MAX_RADIX];
char pad[PAGE_SIZE];
}prefix_node;
struct global_memory {
long Index; /* process ID */
struct prefix_node prefix_tree[2 * MAX_PROCESSORS];
} *global;
void slave_sort(){
.
.
.
long *rank_me_mynum;
struct prefix_node* n;
struct prefix_node* r;
struct prefix_node* l;
.
.
MyNum = global->Index;
global->Index++;
n = &(global->prefix_tree[MyNum]);
for (i = 0; i < radix; i++) {
n->densities[i] = key_density[i];
n->ranks[i] = rank_me_mynum[i];
}
offset = MyNum;
level = number_of_processors >> 1;
base = number_of_processors;
while ((offset & 0x1) != 0) {
offset >>= 1;
r = n;
l = n - 1;
index = base + offset;
n = &(global->prefix_tree[index]);
if (offset != (level - 1)) {
for (i = 0; i < radix; i++) {
n->densities[i] = r->densities[i] + l->densities[i];
n->ranks[i] = r->ranks[i] + l->ranks[i];
}
} else {
for (i = 0; i < radix; i++) {
n->densities[i] = r->densities[i] + l->densities[i];
}
}
base += level;
level >>= 1;
}
Mynum是处理器的数量。我希望在将代码传递给内核之后,将Mynum变为represented by blockIdx.x
。问题是我对结构感到困惑。我不知道如何在内核中传递它们。任何人都可以帮助我吗?
以下代码是否正确?
__global__ void testkernel(prefix_node *prefix_tree, long *dev_rank_me_mynum, long *key_density,long radix)
int i = threadIdx.x + blockIdx.x*blockDimx.x;
prefix_node *n;
prefix_node *l;
prefix_node *r;
long offset;
.
.
.
n = &prefix_tree[blockIdx.x];
if((i%numthreads) == 0){
for(int j=0; j<radix; j++){
n->densities[j] = key_density[j + radix*blockIdx.x];
n->ranks[i] = dev_rank_me_mynum[j + radix*blockIdx.x];
}
.
.
.
}
int main(...){
long *dev_rank_me_mynum;
long *key_density;
prefix_node *prefix_tree;
long radix = 1024;
cudaMalloc((void**)&dev_rank_me_mynum, radix*numblocks*sizeof(long));
cudaMalloc((void**)&key_density, radix*numblocks*sizeof(long));
cudaMalloc((void**)&prefix_tree, numblocks*sizeof(prefix_node));
testkernel<<<numblocks,numthreads>>>(prefix_tree,dev_runk_me_mynum,key_density,radix);
}
答案 0 :(得分:0)
您在编辑中发布的主机API代码看起来很不错。 prefix_node
结构只包含静态声明的数组,所以只需要一个cudaMalloc
调用来为内核分配内存。将prefix_tree
传递给内核的方法也没问题。
内核代码虽然不完整,但包含一些明显的拼写错误,但却是另一回事。看来你的意图是每个块只有一个线程在prefix_tree
的一个“节点”上运行。这将是非常低效的,并且仅利用GPU总容量的一小部分。例如,为什么这样做:
prefix_node *n = &prefix_tree[blockIdx.x];
if((i%numthreads) == 0){
for(int j=0; j<radix; j++){
n->densities[j] = key_density[j + radix*blockIdx.x];
n->ranks[j] = dev_rank_me_mynum[j + radix*blockIdx.x];
}
.
.
.
}
你可以这样做:
prefix_node *n = &prefix_tree[blockIdx.x];
for(int j=threadIdx.x; j<radix; j+=blockDim.x){
n->densities[j] = key_density[j + radix*blockIdx.x];
n->ranks[j] = dev_rank_me_mynum[j + radix*blockIdx.x];
}
它会合并内存读取并在您选择运行时使用块中的尽可能多的线程,而不仅仅是一个,结果应该快很多倍。所以也许你应该重新考虑直接尝试将你发布的串行C代码翻译成内核的策略....