我是CUDA的新手。因此,如果有任何简单的解决方案,请提出疑问。
我试图找到一个数组的100M浮点元素的总和。从下面的代码中,您可以看到我使用了归约内核,并且thrust.我假设内核将总和存储在g_odata[0]
中。由于g_idata
中的所有元素都相同,因此结果应为n*g_idata[1]
。但是您可以清楚地看到两个结果都不正确。
这是我的代码:
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <iomanip>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
using namespace std;
__global__ void reduce(float *g_idata, float *g_odata) {
__shared__ float sdata[256];
int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[threadIdx.x] = g_idata[i];
__syncthreads();
for (int s=1; s < blockDim.x; s *=2)
{
int index = 2 * s * threadIdx.x;;
if (index < blockDim.x)
{
sdata[index] += sdata[index + s];
}
__syncthreads();
}
if (threadIdx.x == 0)
atomicAdd(g_odata,sdata[0]);
}
int main(void){
unsigned int n=pow(10,8);
float *g_idata, *g_odata;
cudaMallocManaged(&g_idata, n*sizeof(float));
cudaMallocManaged(&g_odata, n*sizeof(float));
int blockSize = 32;
int numBlocks = (n + blockSize - 1) / blockSize;
for(int i=0;i<n;i++){g_idata[i]=6.1;g_odata[i]=0;}
reduce<<<numBlocks, blockSize>>>(g_idata, g_odata);
cudaDeviceSynchronize();
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
g_odata[0]=thrust::reduce(thrust::device, g_idata, g_idata+n);
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
cudaFree(g_idata);
cudaFree(g_odata);
}
结果:
6.0129e+08 6.1e+08 8.7097e+06
6.09986e+08 6.1e+08 13824
我正在使用CUDA 10。nvcc --version
:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
我的GPU DeviceQuery
的详细信息:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 750"
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 1999 MBytes (2096168960 bytes)
( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
GPU Max Clock rate: 1110 MHz (1.11 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS
谢谢。
答案 0 :(得分:2)
我认为您对这里的结果感到困惑的原因是缺乏对浮点运算的理解。 This whitepaper很好地涵盖了该主题。作为一个简单的概念,如果我有表示为float
个数量的数字,而我尝试这样做:
100000000 +1
结果将是:100000000(编写一些代码,然后自己尝试)
这不是GPU独有的,CPU代码的行为方式相同(尝试)。
因此,对于非常大的缩减,我们到了(通常)将非常大的数字加到非常小的数字上的地步,并且从“纯数学”的角度来看结果并不准确。
从根本上讲,这是问题所在。在您的CPU代码中,当您确定正确的结果应为6.1 * n时,这种乘法问题不受限于我刚才所描述的将大数与小数相加的限制,因此您可以从中获得“准确”的结果那。
证明或解决此问题的方法之一是使用double
表示形式而不是float
。这并不能完全消除问题,但可以将分辨率提高到可以更好地表示此处数字范围的程度。
以下代码主要进行了更改。您可以更改typedef
以比较float
和double
之间的行为。
代码中还有其他一些更改。这些都不是您所看到的差异的原因。
$ cat t18.cu
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <iomanip>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
#define BLOCK_SIZE 32
typedef double ft;
using namespace std;
__device__ double my_atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val +
__longlong_as_double(assumed)));
// Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
} while (assumed != old);
return __longlong_as_double(old);
}
__device__ float my_atomicAdd(float* addr, float val){
return atomicAdd(addr, val);
}
__global__ void reduce(ft *g_idata, ft *g_odata, int n) {
__shared__ ft sdata[BLOCK_SIZE];
int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[threadIdx.x] = (i < n)?g_idata[i]:0;
__syncthreads();
for (int s=1; s < blockDim.x; s *=2)
{
int index = 2 * s * threadIdx.x;;
if ((index +s) < blockDim.x)
{
sdata[index] += sdata[index + s];
}
__syncthreads();
}
if (threadIdx.x == 0)
my_atomicAdd(g_odata,sdata[0]);
}
int main(void){
unsigned int n=pow(10,8);
ft *g_idata, *g_odata;
cudaMallocManaged(&g_idata, n*sizeof(ft));
cudaMallocManaged(&g_odata, sizeof(ft));
cout << "n = " << n << endl;
int blockSize = BLOCK_SIZE;
int numBlocks = (n + blockSize - 1) / blockSize;
g_odata[0] = 0;
for(int i=0;i<n;i++){g_idata[i]=6.1;}
reduce<<<numBlocks, blockSize>>>(g_idata, g_odata, n);
cudaDeviceSynchronize();
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
g_odata[0]=thrust::reduce(thrust::device, g_idata, g_idata+n);
cout << g_odata[0] << "\t" << (float)n*g_idata[1] << "\t"<< (float)n*g_idata[1]-g_odata[0]<<endl;
cudaFree(g_idata);
cudaFree(g_odata);
}
$ nvcc -o t18 t18.cu
$ cuda-memcheck ./t18
========= CUDA-MEMCHECK
n = 100000000
6.1e+08 6.1e+08 0.00527966
6.1e+08 6.1e+08 5.13792e-05
========= ERROR SUMMARY: 0 errors
$