Question

你好，我的问题是任何建议都会被接受：

我有结构数组（代表粒子），但为了简化，我的数组在开始时只包含 True 值（Particle.exist = True）。我在这个数组上运行我自己的 CUDA 内核函数，在某些情况下，True 值更改为 False。之后，我必须将此值移动到数组的末尾以获得更好的优化（不再使用死粒子（存在 = False））。

理论上我有两种方法可以做到这一点...

一些并行排序算法或
将死粒子移到末尾并移动数组。

第二个选项应该是更好的选择，但我不知道如何并行执行此操作。我可以有 1 000 000 个粒子，所以在一个线程中移动不是一个好主意...

这是我的代码示例。我把 Todo 放在需要移位数组的部分

struct Particle
{
    float2 position;
    float angle; 
    bool exists;
};

__global__ void moveParticles(Particle* particles, const unsigned int lengthOfParticles, const Particle* leaders, const unsigned int lengthOfLeaders, const unsigned int sizeOfLeader, const float speedFactor, const cudaTextureObject_t heightMapTexture)
{
    unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned int skip = gridDim.x * blockDim.x;

    while (idx < lengthOfParticles)
    {
        // If particle does not exist then do nothing and skip
        if (!particles[idx].exists) { idx += skip; continue; }

        float bestLength = 3.40282e+038;
        unsigned int bestLeaderIndex;
        for (unsigned int i = 0; i < lengthOfLeaders; i++)
        {
            float currentLength = (
                (particles[idx].position.x - leaders[i].position.x) * (particles[idx].position.x - leaders[i].position.x)
                ) + (
                    (particles[idx].position.y - leaders[i].position.y) * (particles[idx].position.y - leaders[i].position.y)
                    );

            if (currentLength < bestLength)
            {
                bestLength = currentLength;
                bestLeaderIndex = i;
            }
        }

        Particle bestLeader = leaders[bestLeaderIndex];
        float differenceX = bestLeader.position.x - particles[idx].position.x;
        float differenceY = bestLeader.position.y - particles[idx].position.y;
        float newLength = sqrtf(differenceX * differenceX + differenceY * differenceY);

        // If the newLenght is equal to zero, then the particle is at the same position as leader
        // TODO: HERE I NEED SORT NOT EXISTING PARTICLE TO THE END
        if (newLength <= sizeOfLeader / 2) { particles[idx].exists = false; idx += skip; continue; }

        // Current height at the position
        const uchar4 texelOfHeight = tex2D<uchar4>(heightMapTexture, particles[idx].position.x, particles[idx].position.y);

        // Normalize vector
        differenceX /= newLength;
        differenceY /= newLength;

        int nextPositionOnMapX = round(particles[idx].position.x + differenceX);
        int nextPositionOnMapY = round(particles[idx].position.y + differenceY);

        // Height of the next position
        const uchar4 texelOfNextPosition = tex2D<uchar4>(heightMapTexture, nextPositionOnMapX, nextPositionOnMapY);

        float differenceHeight = texelOfHeight.x - texelOfNextPosition.x;

        float speed = sqrtf(speedFactor + 2 * fabsf(differenceHeight));

        // Multiply by speed
        differenceX *= speed;
        differenceY *= speed;

        particles[idx].position.x += differenceX;
        particles[idx].position.y += differenceY;

        idx += skip;
    }
}

我正在考虑的一种可能的解决方案是做自己的核函数，它只会移动粒子。像这样

__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle) {
    unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned int skip = gridDim.x * blockDim.x;
    //TODO: Shifting...
}

Answer 1

在 GPU 上排序效率很低，因此最好选择值以保留并基于它们执行分区。要轻松做到这一点，您可以使用非常高效的 CUB（因为它通常实现最佳或最先进的算法）。您可以使用 DevicePartition 或两个 DeviceSelect（前者可能会更快，除非您根本不想保留死粒子）。如果您想执行一些高级调整/优化，也可以使用块原语。

如果您出于某种原因仍想自己执行此操作（例如，减少项目中的依赖项数量），那么您可以在相对较新的设备上使用原子添加，因为它们非常好通过硬件优化。在旧设备上，您可以使用扫描来做到这一点，但它很难实现。问题是当有很多 SM 时，原子不会特别扩展，因此您需要执行一些高级阻塞策略。这是一个未经测试的简单实现来理解这个想法：

// PS: what is the difference between sizeOfParticle and lengthOfParticles?
// pos must be initialized to 0 and contains the number of living particles (pivot) once the kernel finished its execution.
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle, Particle* outParticles, int* pos) {
    unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    const unsigned int skip = gridDim.x * blockDim.x;
    const bool exists = particles[idx].exists;
    const int localPos = atomicAdd(pos, exists); // Here is the important point
    const Particle current = particles[idx];

    // outParticles is a needed temporary array or output one
    // as the operation cannot be efficiently performed in parallel.
    // It should likely be allocated and provided in argument to the kernel
    if(exists)
    {
        // Move the current particle to the beginning
        outParticles[localPos] = current;
    }
    else
    {
        // Move the current particle to the end
        outParticles[lengthOfParticles-1-idx+localPos] = current;
    }
}

请注意，由于原子操作，排序没有保留。如果您需要保持粒子的顺序，那么它会变得更加复杂，尤其是在 GPU 上，因为它会使算法更加连续。在这种情况下，一个简单的解决方案可能是使用稳定的排序。另一种解决方案是使用全局 scan 后跟一个间接存储值（因此需要两次传递）。实施有效的扫描有点复杂/乏味。希望 CUB 可以在这种情况下通过其 DeviceScan 原语提供很多帮助。

最后请注意，使用结构数组效率不高，尤其是在使用 SIMD 指令（如 GPU）的硬件上。使用数组结构（由于缓存行、合并、访问模式的连续性等），实现速度应该明显更快。

Cuda 将数组中的元素移动到末尾

1 个答案: