K表示GPU中的聚类加速(CUDA)

时间:2015-03-21 20:07:46

标签: cuda parallel-processing gpgpu k-means nsight

我是一个相当新的cuda用户。我在我的第一个cuda应用程序上练习,我尝试使用GPU(GTX 670)加速kmeans算法。

简单地说,每个线程在一个点上工作,该点与所有聚类中心进行比较,并且一个点被分配给具有最小距离的中心(内核代码可以在下面看到注释)。

根据Nsight Visual Studio的说法,占用率为99.61%(1024个块,每个块1024个线程),99.34%的流式多处理器活动,79.98%的warp发布效率,没有共享内存库冲突,18.4GFLOPs Single MUL和55.2 GFLOPs单个ADD(使用给定参数完成kmeans内核大约需要14.5 ms)。

根据维基百科的说法,GTX670的最高性能是2460 GFLOP。我无处接近。除此之外,一些论文声称它们可以达到峰值性能的一半以上。我无法看到我如何进一步优化这个内核代码。我可以应用于内核的任何优化吗?任何建议或帮助表示赞赏,我可以根据需要提供任何其他信息。

Complete Code

提前致谢。

#define SIZE 1024*1024 //number of points
#define CENTERS 32     //number of cluster centroids
#define DIM 8          //dimension of each point and center
#define cudaTHREADSIZE 1024 //threads per block
#define cudaBLOCKSIZE SIZE/cudaTHREADSIZE //number of blocks for kernel

__global__ void kMeans(float *dp, float *dc,int *tag, int *membershipChangedPerBlock)
{
    //TOTAL NUMBER OF THREADS SHOULD BE EQUAL TO THE NUMBER OF POINTS, BECAUSE EACH THREAD WORKS ON A SINGLE POINT
    __shared__ unsigned char  membershipChanged[cudaTHREADSIZE];
    __shared__ float dc_shared[CENTERS*DIM];

    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int threadID = threadIdx.x;

    membershipChanged[threadIdx.x] = 0;
    //move centers to shared memory, because each and every thread will call it(roughly + %10 performance here)
    while(threadID < CENTERS*DIM){
        dc_shared[threadID] = dc[threadID];

        threadID += blockDim.x;
    }
    __syncthreads();

    while(tid < SIZE){
        int   index,prevIndex;
        float dist, min_dist;

        index = 0;//all initial point indices(centroid number) are assigned to 0.
        prevIndex = 0;
        dist = 0;
        min_dist = 0;

        //euclid distance for center 0
        for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
            min_dist += (dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS]);
        }

        //euclid distance for other centers with distance comparison
        for(int centerIdx = 1; centerIdx < CENTERS; centerIdx++){
            dist = 0;
            for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
                dist += (dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS]);    
            }   
            //compare distances, if found a shorter one, change index to that centroid number
            if(dist < min_dist){
                min_dist = dist;
                index = centerIdx;
            }
        }

        if (tag[tid] != index) {//if a point's cluster membership changes, flag it as changed in order to compute total membership changes later on
            membershipChanged[threadIdx.x] = 1;
        }
        tag[tid] = index;

        __syncthreads();//sync before applying sum reduction to membership changes


        //sum reduction
        for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
            if (threadIdx.x < s) {
                membershipChanged[threadIdx.x] +=
                    membershipChanged[threadIdx.x + s];
            }
            __syncthreads();
        }

        if (threadIdx.x == 0) {
            membershipChangedPerBlock[blockIdx.x] = membershipChanged[0];
        }
        tid += blockDim.x * gridDim.x;
    }
}

1 个答案:

答案 0 :(得分:3)

我的建议是将您的工作与更加经验丰富的GPU开发人员的工作进行比较。我发现Kmeans算法是由Byran Catanzaro在观看video之后编写的。你可以找到源代码:

https://github.com/bryancatanzaro/kmeans

我也是初学者,但恕我直言,最好使用像“Trust”这样的库。 GPU编程是一个非常复杂的问题,很难实现最大性能“信任”会帮助你。