有什么方法可以减少CUDA中数组的总100M float元素?

时间:2018-10-12 05:13:32

标签: parallel-processing cuda thrust reduction

我是CUDA的新手。因此,如果有任何简单的解决方案,请提出疑问。

我试图找到一个数组的100M浮点元素的总和。从下面的代码中,您可以看到我使用了归约内核,并且thrust.我假设内核将总和存储在g_odata[0]中。由于g_idata中的所有元素都相同,因此结果应为n*g_idata[1]。但是您可以清楚地看到两个结果都不正确。

  1. 我怎么了?我怎样才能达到目标?
  2. 我发现的每个归约内核都是整数数据类型。例如强烈推荐的Optimizing Parallel Reduction in CUDA.。有什么具体原因吗?

这是我的代码:

    #include <iostream>
    #include <math.h>
    #include <stdlib.h>
    #include <iomanip>
    #include <thrust/reduce.h>
    #include <thrust/execution_policy.h>


    using namespace std;


    __global__ void reduce(float *g_idata, float *g_odata) {

    __shared__ float sdata[256];


    int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[threadIdx.x] = g_idata[i];

    __syncthreads();

    for (int s=1; s < blockDim.x; s *=2)
    {
        int index = 2 * s * threadIdx.x;;

        if (index < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }
        __syncthreads();
    }


    if (threadIdx.x == 0)
        atomicAdd(g_odata,sdata[0]);
    }




    int main(void){

    unsigned int n=pow(10,8);
    float *g_idata, *g_odata;

    cudaMallocManaged(&g_idata, n*sizeof(float));
    cudaMallocManaged(&g_odata, n*sizeof(float));

    int blockSize = 32;
    int numBlocks = (n + blockSize - 1) / blockSize;

    for(int i=0;i<n;i++){g_idata[i]=6.1;g_odata[i]=0;}


    reduce<<<numBlocks, blockSize>>>(g_idata, g_odata);
    cudaDeviceSynchronize();


    cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;

    g_odata[0]=thrust::reduce(thrust::device, g_idata, g_idata+n);

    cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;



    cudaFree(g_idata);
    cudaFree(g_odata);

    }

结果:

6.0129e+08  6.1e+08 8.7097e+06
6.09986e+08 6.1e+08 13824

我正在使用CUDA 10。nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

我的GPU DeviceQuery的详细信息:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 750"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 1999 MBytes (2096168960 bytes)
  ( 4) Multiprocessors, (128) CUDA Cores/MP:     512 CUDA Cores
  GPU Max Clock rate:                            1110 MHz (1.11 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

谢谢。

1 个答案:

答案 0 :(得分:2)

我认为您对这里的结果感到困惑的原因是缺乏对浮点运算的理解。 This whitepaper很好地涵盖了该主题。作为一个简单的概念,如果我有表示为float个数量的数字,而我尝试这样做:

100000000 +1

结果将是:100000000(编写一些代码,然后自己尝试)

这不是GPU独有的,CPU代码的行为方式相同(尝试)。

因此,对于非常大的缩减,我们到了(通常)将非常大的数字加到非常小的数字上的地步,并且从“纯数学”的角度来看结果并不准确。

从根本上讲,这是问题所在。在您的CPU代码中,当您确定正确的结果应为6.1 * n时,这种乘法问题不受限于我刚才所描述的将大数与小数相加的限制,因此您可以从中获得“准确”的结果那。

证明或解决此问题的方法之一是使用double表示形式而不是float。这并不能完全消除问题,但可以将分辨率提高到可以更好地表示此处数字范围的程度。

以下代码主要进行了更改。您可以更改typedef以比较floatdouble之间的行为。

代码中还有其他一些更改。这些都不是您所看到的差异的原因。

$ cat t18.cu
    #include <iostream>
    #include <math.h>
    #include <stdlib.h>
    #include <iomanip>
    #include <thrust/reduce.h>
    #include <thrust/execution_policy.h>

    #define BLOCK_SIZE 32
    typedef double ft;
    using namespace std;

    __device__ double my_atomicAdd(double* address, double val)
    {
      unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
      unsigned long long int old = *address_as_ull, assumed;

      do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
                               __longlong_as_double(assumed)));

      // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
      } while (assumed != old);

      return __longlong_as_double(old);
    }
    __device__ float my_atomicAdd(float* addr, float val){
        return atomicAdd(addr, val);
    }

    __global__ void reduce(ft *g_idata, ft *g_odata, int n) {

    __shared__ ft sdata[BLOCK_SIZE];

    int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[threadIdx.x] = (i < n)?g_idata[i]:0;

    __syncthreads();

    for (int s=1; s < blockDim.x; s *=2)
    {
        int index = 2 * s * threadIdx.x;;

        if ((index +s) < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }
        __syncthreads();
    }


    if (threadIdx.x == 0)
        my_atomicAdd(g_odata,sdata[0]);
    }




    int main(void){

    unsigned int n=pow(10,8);

    ft *g_idata, *g_odata;

    cudaMallocManaged(&g_idata, n*sizeof(ft));
    cudaMallocManaged(&g_odata, sizeof(ft));
    cout << "n = " << n << endl;
    int blockSize = BLOCK_SIZE;
    int numBlocks = (n + blockSize - 1) / blockSize;
    g_odata[0] = 0;
    for(int i=0;i<n;i++){g_idata[i]=6.1;}


    reduce<<<numBlocks, blockSize>>>(g_idata, g_odata, n);
    cudaDeviceSynchronize();


    cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;

    g_odata[0]=thrust::reduce(thrust::device, g_idata, g_idata+n);

    cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;



    cudaFree(g_idata);
    cudaFree(g_odata);

    }
$ nvcc -o t18 t18.cu
$ cuda-memcheck ./t18
========= CUDA-MEMCHECK
n = 100000000
6.1e+08 6.1e+08 0.00527966
6.1e+08 6.1e+08 5.13792e-05
========= ERROR SUMMARY: 0 errors
$