Question

摘要：

有关如何进一步改进CUDA中基本分散操作的任何想法？特别是如果有人知道它只会用于将较大的阵列压缩成较小的阵列？或者为什么下面的矢量化内存操作和共享内存的方法不起作用？我觉得可能有一些我缺少的基本内容，任何帮助都会受到赞赏。

编辑03/09/15：所以我发现了Parallel For All Blog post＆＃34;使用Warp-Aggregated Atomics优化过滤＆＃34;。我原本认为原子会为此目的本质上变慢，但是我错了 - 特别是因为我不认为我在模拟过程中关心维持数组中的元素顺序。我将不得不再考虑一下，然后实施它以查看会发生什么！

编辑01/04/16：我意识到我从未写过我的结果。不幸的是，在Parallel for All Blog帖子中，他们将紧凑的全局原子方法与Thrust前缀和紧凑方法进行了比较，这种方法实际上非常慢。 CUB的Device :: IF比Thrust快得多 - 就像我用CUB的Device :: Scan +自定义代码编写的前缀和版本一样。 warp-aggregrate全局原子方法仍然快了大约5-10％，但远不及我希望基于博客结果的3-4倍。我仍然使用prefix-sum方法，因为不需要维护元素顺序，我更喜欢前缀和结果的一致性，而原子的优势并不是很大。我仍然尝试各种方法来改进紧凑，但到目前为止只有边际改进（2％）最多可以显着提高代码复杂性。

详细说明：

我正在CUDA中编写一个模拟，在那里我压缩了我不再有兴趣模拟每40-60个时间步骤的元素。从分析中看，分散操作在压缩时占用了大部分时间 - 比滤波器内核或前缀和更多。现在我使用一个非常基本的分散函数：

    __global__ void scatter_arrays(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
            int myID =  blockIdx.x*blockDim.x + threadIdx.x;
            for(int id = myID; id < freq_Index; id+= blockDim.x*gridDim.x){
                 if(flag[id]){
                    new_freq[scan_Index[id]] = freq[id];
                 }
             } 
    }

freq_Index是旧数组中元素的数量。标志数组是过滤器的结果。 Scan_ID是标志数组上前缀sum的结果。

我试图改进它的方法是首先将标记的频率读入共享内存，然后从共享内存写入全局内存 - 这个想法是对全局内存的写入将在warp中更加合并（例如，代替线程0写入位置0而线程128写入位置1，线程0将写入0并且线程1将写入1）。我也尝试了对读取和写入进行矢量化 - 而不是读取和写入浮点数/整数，我在可能的情况下从全局数组中读取/写入float4 / int4，因此一次四个数字。我认为这可以通过减少内存操作来传输更大量的内存来加速分散。＆＃34;厨房水槽＆＃34;带有矢量化内存加载/存储和共享内存的代码如下：

    const int compact_threads = 256;
    __global__ void scatter_arrays2(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
        int gID =  blockIdx.x*blockDim.x + threadIdx.x; //global ID
        int tID = threadIdx.x; //thread ID within block
        __shared__ float row[4*compact_threads];
        __shared__ int start_index[1];
        __shared__ int end_index[1];
        float4 myResult;
        int st_index;
        int4 myFlag;
        int4 index;
        for(int id = gID; id < freq_Index/4; id+= blockDim.x*gridDim.x){
            if(tID == 0){
                index = reinterpret_cast<const int4*>(scan_Index)[id];
                myFlag = reinterpret_cast<const int4*>(flag)[id];
                start_index[0] = index.x;
                st_index = index.x;
                myResult = reinterpret_cast<const float4*>(freq)[id];
                if(myFlag.x){ row[0] = myResult.x; }
                if(myFlag.y){ row[index.y-st_index] = myResult.y; }
                if(myFlag.z){ row[index.z-st_index] = myResult.z; }
                if(myFlag.w){ row[index.w-st_index] = myResult.w; }
            }
            __syncthreads();
            if(tID > 0){
                myFlag = reinterpret_cast<const int4*>(flag)[id];
                st_index = start_index[0];
                index = reinterpret_cast<const int4*>(scan_Index)[id];
                myResult = reinterpret_cast<const float4*>(freq)[id];
                if(myFlag.x){ row[index.x-st_index] = myResult.x; }
                if(myFlag.y){ row[index.y-st_index] = myResult.y; }
                if(myFlag.z){ row[index.z-st_index] = myResult.z; }
                if(myFlag.w){ row[index.w-st_index] = myResult.w; }
                if(tID == blockDim.x -1 || gID == mutations_Index/4 - 1){ end_index[0] = index.w + myFlag.w; }
            }
            __syncthreads();
            int count = end_index[0] - st_index;

            int rem = st_index & 0x3; //equivalent to modulo 4
            int offset = 0;
            if(rem){ offset = 4 - rem; }

            if(tID < offset && tID < count){
                new_mutations_freq[population*new_array_Length+st_index+tID] = row[tID];
            }

            int tempID = 4*tID+offset;
            if((tempID+3) < count){
                reinterpret_cast<float4*>(new_freq)[tID] = make_float4(row[tempID],row[tempID+1],row[tempID+2],row[tempID+3]);
            }

            tempID = tID + offset + (count-offset)/4*4;
            if(tempID < count){ new_freq[st_index+tempID] = row[tempID]; }
        }
        int id = gID + freq_Index/4 * 4; 
        if(id < freq_Index){
            if(flag[id]){
                new_freq[scan_Index[id]] = freq[id];
            }
        }
    }

显然它变得有点复杂。 :)虽然当数组中有数十万个元素时，上面的内核看起来很稳定，但是当阵列数字达到数千万时，我注意到了竞争条件。我仍然试图追踪这个错误。

但无论如何，这两种方法（共享内存或矢量化）不能同时或单独提高性能。我对向量化内存操作缺乏益处感到特别惊讶。它帮助了我写过的其他函数，不过现在我想知道它是否有帮助，因为它增加了那些其他函数的计算步骤中的指令级并行性，而不是更少的内存操作。

Answer 1

我发现这个poster中提到的算法（类似的算法也在这个paper中讨论过）非常有效，特别是对于压缩大型数组。它使用较少的内存来执行此操作，并且比我之前的方法（5-10％）略快。我对海报的算法进行了一些调整：1）消除了阶段1中的最终warp shuffle减少，可以简单地对元素进行求和，2）使函数能够处理不仅仅是作为一个大小的数组。多个1024 +添加网格跨越循环，以及3）允许每个线程在阶段3中同时加载其寄存器而不是一次加载一个。我还使用CUB代替Thrust for Inclusive sum来加快扫描速度。可能会有更多的调整，但现在这很好。

//kernel phase 1
int myID =  blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
    int lnID = threadIdx.x % warp_size;
    int warpID = id >> 5;

    unsigned int mask;
    unsigned int cnt=0;//;//

    for(int j = 0; j < 32; j++){
        int index = (warpID<<10)+(j<<5)+lnID;

        bool pred;
        if(index > true_length) pred = false;
        else pred = predicate(input[index]);
        mask = __ballot(pred); 

        if(lnID == 0) {
            flag[(warpID<<5)+j] = mask;
            cnt += __popc(mask);
        }
    }

    if(lnID == 0) counter[warpID] = cnt; //store sum
}

//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array

//kernel phase 3
int myID =  blockIdx.x*blockDim.x + threadIdx.x;

for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
    int lnID = threadIdx.x % warp_size;
    int warpID = id >> 5;

    unsigned int predmask;
    unsigned int cnt;

    predmask = flag[(warpID<<5)+lnID];
    cnt = __popc(predmask);

    //parallel prefix sum
#pragma unroll
    for(int offset = 1; offset < 32; offset<<=1){
        unsigned int n = __shfl_up(cnt, offset);
        if(lnID >= offset) cnt += n;
    }

    unsigned int global_index = 0;
    if(warpID > 0) global_index = scan_Index[warpID - 1];

    for(int i = 0; i < 32; i++){
        unsigned int mask = __shfl(predmask, i); //broadcast from thread i
        unsigned int sub_group_index = 0;
        if(i > 0) sub_group_index = __shfl(cnt, i-1);
        if(mask & (1 << lnID)){
            compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID]; 
        }
    }
}

}

编辑：海报作者的子集中有一个较新的article，他们检查紧凑的变化比上面写的更快。但是，他们的新版本不是订单保留，所以对我自己没用，我没有实现它来测试它。也就是说，如果您的项目不依赖于对象顺序，那么他们较新的紧凑版本可能会加速您的算法。

提高CUDA中压缩/散射效率

1 个答案: