我是一名初学者cuda程序员,
我正在尝试构建一个类似于Nvidia粒子系统示例的应用程序(多维数据集中的多个球)。
我有一个内核louncher函数如下:
void Ccuda:: sort_Particles_And_Find_Cell_Start (int *Cell_Start, // output
int *Cell_End, // output
float3 *Sorted_Pos, // output
float3 *Sorted_Vel, //output
int *Particle_Cell, // input
int *Particle_Index, // input
float3 *Old_Pos,
float3 *Old_Vel,
int Num_Particles,
int Num_Cells)
{
int numThreads, numBlocks;
/*Cell_Start = (int*) cudaAlloc (Num_Cells, sizeof(int));
Cell_End = (int*) cudaAlloc (Num_Cells, sizeof(int));
Sorted_Pos = (float3*) cudaAlloc (Num_Particles, sizeof(int));
Sorted_Vel = (float3*) cudaAlloc (Num_Particles, sizeof(int));*/
int *h_p_cell = (int *) malloc (Num_Particles * sizeof (int));
cudaMemcpy (h_p_cell,Particle_Cell, Num_Particles*sizeof(int),cudaMemcpyDeviceToHost);
free (h_p_cell);
computeGridSize(Num_Particles, 512, numBlocks, numThreads);
sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);
h_p_cell = (int *) malloc (Num_Particles * sizeof (int));
cudaMemcpy (h_p_cell,Particle_Cell, Num_Particles*sizeof(int),cudaMemcpyDeviceToHost);
free (h_p_cell);
}
这个全局内核函数:
__global__ void sort_Particles_And_Find_Cell_StartD(int *Cell_Start, // output
int *Cell_End, // output
float3 *Sorted_Pos, // output
float3 *Sorted_Vel, //output
int *Particle_Cell, // input
int *Particle_Index, // input
float3 *Old_Pos,
float3 *Old_Vel,
int Num_Particles)
{
int hash;
extern __shared__ int Shared_Hash[]; // blockSize + 1 elements
int index = blockIdx.x*blockDim.x + threadIdx.x;
if (index < Num_Particles)
{
hash = Particle_Cell[index];
Shared_Hash[threadIdx.x+1] = hash;
if (index > 0 && threadIdx.x == 0)
{
// first thread in block load previous particle hash
Shared_Hash[0] = Particle_Cell[index-1];
}
}
__syncthreads();
if (index < Num_Particles)
{
// If this particle has a different cell index to the previous
// particle then it must be the first particle in the cell,
// so store the index of this particle in the cell.
// As it isn't the first particle, it must also be the cell end of
// the previous particle's cell
if (index == 0 || hash != Shared_Hash[threadIdx.x]) // if its the first thread in the grid or its particle cell index is different from cell index of the previous neighboring thread
{
Cell_Start[hash] = index;
if (index > 0)
Cell_End[Shared_Hash[threadIdx.x]] = index;
}
if (index == Num_Particles - 1)
{
Cell_End[hash] = index + 1;
}
// Now use the sorted index to reorder the pos and vel data
int Sorted_Index = Particle_Index[index];
//float3 pos = FETCH(Old_Pos, Sorted_Index); // macro does either global read or texture fetch
//float3 vel = FETCH(Old_Vel, Sorted_Index); // see particles_kernel.cuh
float3 pos = Old_Pos[Sorted_Index];
float3 vel = Old_Vel[Sorted_Index];
Sorted_Pos[index] = pos;
Sorted_Vel[index] = vel;
}
在执行期间我得到了这个调试arror massege r6010说已经调用了中止。
正如您在louncher函数(第一个)中看到的那样,我使用int * h_p_cell来查看 内核执行前后的Particle_Cell内容,似乎内容已被更改,尽管内核中没有赋值给Particle_Cell。 在程序init()期间由cudaMemcpy分配的Particle_Cell内存。我已经尝试了几天来解决这个问题,但没有成功 有人可以帮忙吗?
答案 0 :(得分:1)
您的内核期望动态分配shared memory:
extern __shared__ int Shared_Hash[]; // blockSize + 1 elements
但是你没有在你的内核调用中分配任何东西:
sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);
^
|
missing shared memory size parameter
您应该在launch configuration中提供共享内存金额。你可能想要这样的东西:
sort_Particles_And_Find_Cell_StartD<<<numBlocks, numThreads, ((numThreads+1)*sizeof(int))>>>(Cell_Start,Cell_End, Sorted_Pos, Sorted_Vel, Particle_Cell, Particle_Index, Old_Pos, Old_Vel, Num_Particles);
此错误将导致内核在尝试访问共享内存时中止。 您还应该对所有cuda API调用和内核调用执行cuda error checking。我在你的代码中没有看到任何证据。
排除所有API错误后,请使用cuda-memcheck
运行代码。对Particle_Cell
进行意外写入的原因可能是由于内核的越界访问,cuda-memcheck
会显现这种情况。