Question

Edit 2: Please take a look at this crosspost for TLDR.

Edit: Given that the particles are segmented into grid cells (say 16^3 grid), is it a better idea to let run one work-group for each grid cell and as many work-items in one work-group as there can be maximal number of particles per grid cell?

In that case I could load all particles from neighboring cells into local memory and iterate through them computing some properties. Then I could write specific value into each particle in the current grid cell.

Would this approach be beneficial over running the kernel for all particles and for each iterating over (most of the time the same) neighbors?

Also, what is the ideal ratio of number of particles/number of grid cells?

I'm trying to reimplement (and modify) CUDA Particles for OpenCL and use it to query nearest neighbors for every particle. I've created the following structures:

Buffer P holding all particles' 3D positions (float3)
Buffer Sp storing int2 pairs of particle ids and their spatial hashes. Sp is sorted according to the hash. (The hash is just a simple linear mapping from 3D to 1D – no Z-indexing yet.)
Buffer L storing int2 pairs of starting and ending positions of particular spatial hashes in buffer Sp. Example: L[12] = (int2)(0, 50).
- L[12].x is the index (in Sp) of the first particle with spatial hash 12.
- L[12].y is the index (in Sp) of the last particle with spatial hash 12.

Now that I have all these buffers, I want to iterate through all the particles in P and for each particle iterate through its nearest neighbors. Currently I have a kernel that looks like this (pseudocode):

__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
  size_t gid             = get_global_id(0);
  float3 curr_particle   = P[gid];
  int    processed_value = 0;

  for(int x=-1; x<=1; x++)
    for(int y=-1; y<=1; y++)
      for(int z=-1; z<=1; z++) {

        float3 neigh_position = curr_particle + (float3)(x,y,z)*GRID_CELL_SIDE;

        // ugly boundary checking
        if ( dot(neigh_position<0,        (float3)(1)) +
             dot(neigh_position>BOUNDARY, (float3)(1))   != 0)
             continue;

        int neigh_hash        = spatial_hash( neigh_position );
        int2 particles_range  = L[ neigh_hash ];

        for(int p=particles_range.x; p<particles_range.y; p++)
          processed_value += heavy_computation( P[ Sp[p].y ] );

      }

  Out[gid] = processed_value;
}

The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.

What I want to do is to use Z-order curve as the spatial hash. That way I could have only 1 for loop iterating through a continuous range of memory when querying neighbors. The only problem is that I don't know what should be the start and stop Z-index values.

The holy grail I want to achieve:

__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
  size_t gid             = get_global_id(0);
  float3 curr_particle   = P[gid];
  int    processed_value = 0;

  // How to accomplish this??
  // `get_neighbors_range()` returns start and end Z-index values
  // representing the start and end near neighbors cells range
  int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
  int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
  int last_particle_id  = L[ nearest_neighboring_cells_range.y ].y;

  for(int p=first_particle_id; p<=last_particle_id; p++) {
      processed_value += heavy_computation( P[ Sp[p].y ] );
  }

  Out[gid] = processed_value;
}

Answer 1

你应该仔细研究莫顿码算法。 Ericsons实时碰撞检测很好地解释了这一点。

Ericson - Real time Collision detection

这是另一个很好的解释，包括一些测试：

Morton encoding/decoding through bit interleaving: Implementations

Z-Order算法仅定义坐标的路径，您可以在其中从2或3D坐标散列为整数。虽然算法在每次迭代时都会更深入，但您必须自己设置限制。通常，停止指数由哨兵表示。让哨兵停下来会告诉你粒子放在哪个级别。因此，您要定义的最大级别将告诉您每个维度的单元格数。例如，最高级别为6时，您有2 ^ 6 = 64.您的系统中将有64x64x64单元格（3D）。这也意味着您必须使用基于整数的坐标。如果使用浮点数，则必须转换为coord.x = 64*float_x等等。

如果您知道系统中有多少个单元格，则可以定义限制。您是否尝试使用二进制八叉树？

由于粒子处于运动状态（在那个CUDA示例中），您应该尝试并行化粒子数而不是细胞。

如果要构建最近邻居列表，则必须将粒子映射到单元格。这是通过一个表格完成的，该表格随后被细胞分类到粒子。你仍然应该遍历粒子并访问它的邻居。

关于您的代码：

该代码的问题在于它的速度很慢。我怀疑非线性GPU内存访问（特别是最内层for循环中的P [Sp [p] .y）导致缓慢。

记住唐纳德克努特。你应该测量瓶颈的位置。您可以使用NVCC Profiler并寻找瓶颈。不确定OpenCL有什么作为分析器。

    // ugly boundary checking
    if ( dot(neigh_position<0,        (float3)(1)) +
         dot(neigh_position>BOUNDARY, (float3)(1))   != 0)
         continue;

我认为你不应该这样分支，当你拨打heavy_computation时如何返回零。不确定，但也许你在这里有一个分支预测。尝试以某种方式删除它。

只有在没有对粒子数据的写访问时，才能在单元格上并行运行，否则你将不得不使用原子。如果你超越粒子范围而不是读取对单元格和邻居的访问，但是你并行地创建了你的总和，并且你不会被迫使用某种种族争议范式。

此外，粒子数/网格单元数的理想比例是多少？

真的取决于你的算法和你域中的粒子包装，但在你的情况下，我会定义相当于粒子直径的单元格大小，只使用你得到的单元格数。

因此，如果您想使用Z-order并获得圣杯，请尝试使用整数坐标并对其进行哈希处理。

还尝试使用更大量的颗粒。像CUDA这样的大约65000个粒子的例子你应该考虑使用，因为这样的并行化效率最高;正在使用正在运行的处理单元（更少的空闲线程）。

Nearest Neighbors in CUDA Particles

Edit 2: Please take a look at this crosspost for TLDR.

1 个答案: