我是一个相当新的cuda用户。我在我的第一个cuda应用程序上练习,我尝试使用GPU(GTX 670)加速kmeans算法。
简单地说,每个线程在一个点上工作,该点与所有聚类中心进行比较,并且一个点被分配给具有最小距离的中心(内核代码可以在下面看到注释)。
根据Nsight Visual Studio的说法,占用率为99.61%(1024个块,每个块1024个线程),99.34%的流式多处理器活动,79.98%的warp发布效率,没有共享内存库冲突,18.4GFLOPs Single MUL和55.2 GFLOPs单个ADD(使用给定参数完成kmeans内核大约需要14.5 ms)。
根据维基百科的说法,GTX670的最高性能是2460 GFLOP。我无处接近。除此之外,一些论文声称它们可以达到峰值性能的一半以上。我无法看到我如何进一步优化这个内核代码。我可以应用于内核的任何优化吗?任何建议或帮助表示赞赏,我可以根据需要提供任何其他信息。
提前致谢。
#define SIZE 1024*1024 //number of points
#define CENTERS 32 //number of cluster centroids
#define DIM 8 //dimension of each point and center
#define cudaTHREADSIZE 1024 //threads per block
#define cudaBLOCKSIZE SIZE/cudaTHREADSIZE //number of blocks for kernel
__global__ void kMeans(float *dp, float *dc,int *tag, int *membershipChangedPerBlock)
{
//TOTAL NUMBER OF THREADS SHOULD BE EQUAL TO THE NUMBER OF POINTS, BECAUSE EACH THREAD WORKS ON A SINGLE POINT
__shared__ unsigned char membershipChanged[cudaTHREADSIZE];
__shared__ float dc_shared[CENTERS*DIM];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int threadID = threadIdx.x;
membershipChanged[threadIdx.x] = 0;
//move centers to shared memory, because each and every thread will call it(roughly + %10 performance here)
while(threadID < CENTERS*DIM){
dc_shared[threadID] = dc[threadID];
threadID += blockDim.x;
}
__syncthreads();
while(tid < SIZE){
int index,prevIndex;
float dist, min_dist;
index = 0;//all initial point indices(centroid number) are assigned to 0.
prevIndex = 0;
dist = 0;
min_dist = 0;
//euclid distance for center 0
for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
min_dist += (dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS]);
}
//euclid distance for other centers with distance comparison
for(int centerIdx = 1; centerIdx < CENTERS; centerIdx++){
dist = 0;
for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
dist += (dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS]);
}
//compare distances, if found a shorter one, change index to that centroid number
if(dist < min_dist){
min_dist = dist;
index = centerIdx;
}
}
if (tag[tid] != index) {//if a point's cluster membership changes, flag it as changed in order to compute total membership changes later on
membershipChanged[threadIdx.x] = 1;
}
tag[tid] = index;
__syncthreads();//sync before applying sum reduction to membership changes
//sum reduction
for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
if (threadIdx.x < s) {
membershipChanged[threadIdx.x] +=
membershipChanged[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x == 0) {
membershipChangedPerBlock[blockIdx.x] = membershipChanged[0];
}
tid += blockDim.x * gridDim.x;
}
}
答案 0 :(得分:3)
我的建议是将您的工作与更加经验丰富的GPU开发人员的工作进行比较。我发现Kmeans算法是由Byran Catanzaro在观看video之后编写的。你可以找到源代码:
https://github.com/bryancatanzaro/kmeans
我也是初学者,但恕我直言,最好使用像“Trust”这样的库。 GPU编程是一个非常复杂的问题,很难实现最大性能“信任”会帮助你。