Question

我正在尝试计算所有点之间的距离（度量标准加权）。为了提高速度，我在gpu上以及通过cuda和numba进行此操作，因为我认为它更易读并且更易于使用。

我有两个1d点的1d数组，想要计算同一数组中所有点之间的距离以及两个数组之间所有点之间的距离。我已经编写了两个cuda内核，一个仅使用全局内存，我已验证它们使用cpu代码给出了正确的答案。就是这样。

@cuda.jit
def gpuSameSample(A,arrSum):
    tx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
    temp = A[tx]
    tempSum = 0.0
    for i in range(tx+1,A.size):
        distance = (temp - A[i])**2
        tempSum +=  math.exp(-distance/sigma**2)
    arrSum[tx] = tempSum

我现在正在尝试通过使用共享内存进一步优化此功能。这就是我到目前为止所拥有的。

@cuda.jit
def gpuSharedSameSample(A,arrSum):
    #my block size is equal to 32                                                                                                                                                                           
    sA = cuda.shared.array(shape=(tpb),dtype=float32)
    bpg = cuda.gridDim.x
    tx = cuda.threadIdx.x + cuda.blockIdx.x *cuda.blockDim.x
    count = len(A)
    #loop through block by block                                                                                                                                                                            
    tempSum = 0.0
    #myPoint = A[tx]                                                                                                                                                                                        

    if(tx < count):
        myPoint = A[tx]
        for currentBlock in range(bpg):

    #load in a block to shared memory                                                                                                                                                                   
            copyIdx = (cuda.threadIdx.x + currentBlock*cuda.blockDim.x)
            if(copyIdx < count):
                sA[cuda.threadIdx.x] = A[copyIdx]
        #syncthreads to ensure copying finishes first                                                                                                                                                       
            cuda.syncthreads()


            if((tx < count)):
                for i in range(cuda.threadIdx.x,cuda.blockDim.x):
                    if(copyIdx != tx):
                        distance = (myPoint - sA[i])**2
                        tempSum += math.exp(-distance/sigma**2)

 #syncthreads here to avoid race conditions if a thread finishes earlier                                                                                                                             
            #arrSum[tx] += tempSum                                                                                                                                                                          
            cuda.syncthreads()
    arrSum[tx] += tempSum

我相信我在同步线程方面一直很谨慎，但是这个答案给出的答案总是太大（大约5％）。我猜必须有一些竞争条件，但是据我了解，每个线程都会写入一个唯一的索引，并且tempSum变量是每个线程局部的，因此不应有任何竞争条件。我非常确定我的for循环条件正确。任何建议将不胜感激。谢谢。

Answer 1

最好提供完整的代码。只需对您所显示的内容进行一些琐碎的添加即可轻松做到这一点-就像我在下面所做的那样。但是，即使有一组限制性假设，您的两个实现之间也存在差异。

我认为：

您的整体数据集大小是线程块大小的整数倍。
您启动的线程总数与数据集的大小完全一样。

我也不会尝试评论您的共享实现是否有意义，也就是说，应该期望它比非共享实现更好。这似乎不是您问题的症结所在，这就是为什么您会在这两个实现之间获得数值上的差异。

主要问题是，每种情况下用于选择计算成对“距离”的元素的方法都不匹配。在非共享实现中，对于输入数据集中的每个元素i，您正在计算i与每个大于i的元素之间的距离之和：

for i in range(tx+1,A.size):
               ^^^^^^^^^^^

此项目总和与共享实现不匹配：

            for i in range(cuda.threadIdx.x,cuda.blockDim.x):
                if(copyIdx != tx):

这里有几个问题，但是很明显，对于复制的每个块，threadIdx.x位置上的给定元素仅在（数据）块中的目标元素更大时才更新其和。比那个指数。这意味着，当您逐块浏览整个数据集时，您将跳过每个块中的元素。这可能与非共享实现不匹配。如果这不明显，则只需为for循环的范围选择实际值。假设cuda.threadIdx.x为5，而cuda.blockDim.x为32。则该特定元素将仅计算整个数组中每个数据块中项6-31的总和。

解决此问题的方法是，在选择如何为运行总和做出贡献的方式方面，迫使共享实现与非共享实现保持一致。

此外，在非共享实现中，您仅更新输出点一次，并且您在进行直接分配：

arrSum[tx] = tempSum

在共享实现中，您仍然只更新一次输出点，但是您没有进行直接分配。我将其更改为与非共享匹配：

arrSum[tx] += tempSum

以下是解决这些问题的完整代码：

$ cat t49.py
from numba import cuda
import numpy as np
import math
import time
from numba import float32

sigma = np.float32(1.0)
tpb = 32

@cuda.jit
def gpuSharedSameSample(A,arrSum):
    #my block size is equal to 32                                                                                                                               
    sA = cuda.shared.array(shape=(tpb),dtype=float32)
    bpg = cuda.gridDim.x
    tx = cuda.threadIdx.x + cuda.blockIdx.x *cuda.blockDim.x
    count = len(A)
    #loop through block by block                                                                                                                                
    tempSum = 0.0
    #myPoint = A[tx]                                                                                                                                            

    if(tx < count):
        myPoint = A[tx]
        for currentBlock in range(bpg):

    #load in a block to shared memory                                                                                                                           
            copyIdx = (cuda.threadIdx.x + currentBlock*cuda.blockDim.x)
            if(copyIdx < count): #this should always be true
                sA[cuda.threadIdx.x] = A[copyIdx]
        #syncthreads to ensure copying finishes first                                                                                                           
            cuda.syncthreads()


            if((tx < count)):    #this should always be true
                for i in range(cuda.blockDim.x):
                    if(copyIdx-cuda.threadIdx.x+i > tx):
                        distance = (myPoint - sA[i])**2
                        tempSum += math.exp(-distance/sigma**2)

 #syncthreads here to avoid race conditions if a thread finishes earlier                                                                                        
            #arrSum[tx] += tempSum                                                                                                                              
            cuda.syncthreads()
    arrSum[tx] = tempSum

@cuda.jit
def gpuSameSample(A,arrSum):
    tx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
    temp = A[tx]
    tempSum = 0.0
    for i in range(tx+1,A.size):
        distance = (temp - A[i])**2
        tempSum +=  math.exp(-distance/sigma**2)
    arrSum[tx] = tempSum

size = 128
threads_per_block = tpb
blocks = (size + (threads_per_block - 1)) // threads_per_block
my_in  = np.ones( size, dtype=np.float32)
my_out = np.zeros(size, dtype=np.float32)
gpuSameSample[blocks, threads_per_block](my_in, my_out)
print(my_out)
gpuSharedSameSample[blocks, threads_per_block](my_in, my_out)
print(my_out)
$ python t49.py
[ 127.  126.  125.  124.  123.  122.  121.  120.  119.  118.  117.  116.
  115.  114.  113.  112.  111.  110.  109.  108.  107.  106.  105.  104.
  103.  102.  101.  100.   99.   98.   97.   96.   95.   94.   93.   92.
   91.   90.   89.   88.   87.   86.   85.   84.   83.   82.   81.   80.
   79.   78.   77.   76.   75.   74.   73.   72.   71.   70.   69.   68.
   67.   66.   65.   64.   63.   62.   61.   60.   59.   58.   57.   56.
   55.   54.   53.   52.   51.   50.   49.   48.   47.   46.   45.   44.
   43.   42.   41.   40.   39.   38.   37.   36.   35.   34.   33.   32.
   31.   30.   29.   28.   27.   26.   25.   24.   23.   22.   21.   20.
   19.   18.   17.   16.   15.   14.   13.   12.   11.   10.    9.    8.
    7.    6.    5.    4.    3.    2.    1.    0.]
[ 127.  126.  125.  124.  123.  122.  121.  120.  119.  118.  117.  116.
  115.  114.  113.  112.  111.  110.  109.  108.  107.  106.  105.  104.
  103.  102.  101.  100.   99.   98.   97.   96.   95.   94.   93.   92.
   91.   90.   89.   88.   87.   86.   85.   84.   83.   82.   81.   80.
   79.   78.   77.   76.   75.   74.   73.   72.   71.   70.   69.   68.
   67.   66.   65.   64.   63.   62.   61.   60.   59.   58.   57.   56.
   55.   54.   53.   52.   51.   50.   49.   48.   47.   46.   45.   44.
   43.   42.   41.   40.   39.   38.   37.   36.   35.   34.   33.   32.
   31.   30.   29.   28.   27.   26.   25.   24.   23.   22.   21.   20.
   19.   18.   17.   16.   15.   14.   13.   12.   11.   10.    9.    8.
    7.    6.    5.    4.    3.    2.    1.    0.]
$

请注意，如果违反了我的两个假设之一，则您的代码还有其他问题。

如前所述，我鼓励您将来提供简短的完整代码。对于这样的问题，应该没有太多的额外工作。如果没有其他原因（也有其他原因），那么在您已经拥有它的情况下，迫使其他人从头开始编写此代码很繁琐，以证明所提供答案的敏感性。

使用共享内存计算点之间的距离

1 个答案: