我正在读Jason Sanders和Edward Kandrot所写的“CUDA By Example”一书。 到目前为止,每个程序都在我的计算机上顺利运行,直到我坚持第9章的程序。
这两项计划如下:
第一个程序仅在CPU上运行, hist_cpu.cu :
/*
* Copyright 1993-2010 NVIDIA Corporation. All rights reserved.
*
* NVIDIA Corporation and its licensors retain all intellectual property and
* proprietary rights in and to this software and related documentation.
* Any use, reproduction, disclosure, or distribution of this software
* and related documentation without an express license agreement from
* NVIDIA Corporation is strictly prohibited.
*
* Please refer to the applicable NVIDIA end user license agreement (EULA)
* associated with this source code for terms and conditions that govern
* your use of this NVIDIA software.
*
*/
#include "../common/book.h"
#define SIZE (100*1024*1024)
int main( void ) {
unsigned char *buffer =
(unsigned char*)big_random_block( SIZE );
// capture the start time
clock_t start, stop;
start = clock();
unsigned int histo[256];
for (int i=0; i<256; i++)
histo[i] = 0;
for (int i=0; i<SIZE; i++)
histo[buffer[i]]++;
stop = clock();
float elapsedTime = (float)(stop - start) /
(float)CLOCKS_PER_SEC * 1000.0f;
printf( "Time to generate: %3.1f ms\n", elapsedTime );
long histoCount = 0;
for (int i=0; i<256; i++) {
histoCount += histo[i];
}
printf( "Histogram Sum: %ld\n", histoCount );
free( buffer );
return 0;
}
第二个程序在GPU上运行, hist_gpu_shmem_atomics.cu :
/*
* Copyright 1993-2010 NVIDIA Corporation. All rights reserved.
*
* NVIDIA Corporation and its licensors retain all intellectual property and
* proprietary rights in and to this software and related documentation.
* Any use, reproduction, disclosure, or distribution of this software
* and related documentation without an express license agreement from
* NVIDIA Corporation is strictly prohibited.
*
* Please refer to the applicable NVIDIA end user license agreement (EULA)
* associated with this source code for terms and conditions that govern
* your use of this NVIDIA software.
*
*/
#include "../common/book.h"
#define SIZE (100*1024*1024)
__global__ void histo_kernel( unsigned char *buffer,
long size,
unsigned int *histo ) {
// clear out the accumulation buffer called temp
// since we are launched with 256 threads, it is easy
// to clear that memory with one write per thread
__shared__ unsigned int temp[256];
temp[threadIdx.x] = 0;
__syncthreads();
// calculate the starting index and the offset to the next
// block that each thread will be processing
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
while (i < size) {
atomicAdd( &temp[buffer[i]], 1 );
i += stride;
}
// sync the data from the above writes to shared memory
// then add the shared memory values to the values from
// the other thread blocks using global memory
// atomic adds
// same as before, since we have 256 threads, updating the
// global histogram is just one write per thread!
__syncthreads();
atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}
int main( void ) {
unsigned char *buffer =
(unsigned char*)big_random_block( SIZE );
// capture the start time
// starting the timer here so that we include the cost of
// all of the operations on the GPU. if the data were
// already on the GPU and we just timed the kernel
// the timing would drop from 74 ms to 15 ms. Very fast.
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );
// allocate memory on the GPU for the file's data
unsigned char *dev_buffer;
unsigned int *dev_histo;
HANDLE_ERROR( cudaMalloc( (void**)&dev_buffer, SIZE ) );
HANDLE_ERROR( cudaMemcpy( dev_buffer, buffer, SIZE,
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_histo,
256 * sizeof( int ) ) );
HANDLE_ERROR( cudaMemset( dev_histo, 0,
256 * sizeof( int ) ) );
// kernel launch - 2x the number of mps gave best timing
cudaDeviceProp prop;
HANDLE_ERROR( cudaGetDeviceProperties( &prop, 0 ) );
int blocks = prop.multiProcessorCount;
histo_kernel<<<blocks*2,256>>>( dev_buffer,
SIZE, dev_histo );
unsigned int histo[256];
HANDLE_ERROR( cudaMemcpy( histo, dev_histo,
256 * sizeof( int ),
cudaMemcpyDeviceToHost ) );
// get stop time, and display the timing results
HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
printf( "Time to generate: %3.1f ms\n", elapsedTime );
long histoCount = 0;
for (int i=0; i<256; i++) {
histoCount += histo[i];
}
printf( "Histogram Sum: %ld\n", histoCount );
// verify that we have the same counts via CPU
for (int i=0; i<SIZE; i++)
histo[buffer[i]]--;
for (int i=0; i<256; i++) {
if (histo[i] != 0)
printf( "Failure at %d!\n", i );
}
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );
cudaFree( dev_histo );
cudaFree( dev_buffer );
free( buffer );
return 0;
}
它们应该正常运行,它们可以在某些平台上正确运行,如GeForce GTX 285(如书中所述)和K2200(由于程序由其他拥有它的用户测试),CPU程序可达数百毫秒与GPU程序相差几十毫秒。
但它在我的平台上产生了不寻常的结果。
1)Win7_x64 + GeForce 705M
使用文件“353.30-notebook-win8-win7-64bit-international-whql.exe”安装CUDA并运行“nvcc --version”显示
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Wed_Jul_10_13:36:45_PDT_2013
Cuda compilation tools, release 5.5, V5.5.0
2)Ubuntu 10.04(Lucid LTS)+ GeForce GT 630
使用文件“cudatoolkit_3.2.16_linux_64_ubuntu10.04.run”安装CUDA并运行“nvcc --version”显示
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Wed_Jul_17_18:36:13_PDT_2013
Cuda compilation tools, release 5.5, V5.5.0
两个平台都产生了CPU程序和GPU程序数百毫秒运行时间的结果。 GPU甚至比CPU还要长。
可以观察到任何原因吗?
这就是我在visual studio 2010中看到的:
CUDA C / C ++命令行:
# (Approximate command-line. Settings inherited from host are not visible below.)
# (Please see the output window after a build for the full command-line)
# Driver API (NVCC Compilation Type is .cubin, .gpu, or .ptx)
set CUDAFE_FLAGS=--sdk_dir "C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\"
"D:\NVIDIA\CUDA\CUDAToolkit\bin\nvcc.exe" --use-local-env --cl-version 2010 -ccbin "D:\Microsoft Visual Studio 10.0\VC\bin\x86_amd64" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -o x64\Release\%(Filename)%(Extension).obj "%(FullPath)"
# Runtime API (NVCC Compilation Type is hybrid object or .c file)
set CUDAFE_FLAGS=--sdk_dir "C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\" "D:\NVIDIA\CUDA\CUDAToolkit\bin\nvcc.exe" --use-local-env --cl-version 2010 -ccbin "D:\Microsoft Visual Studio 10.0\VC\bin\x86_amd64" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -Xcompiler "/EHsc /nologo /Zi " -o x64\Release\%(Filename)%(Extension).obj "%(FullPath)"
链接器命令行:
/OUT:"E:\learn_cuda_by_example_exercises\Chapter9HistGpuShareMemory\x64\Release\Chapter9HistGpuShareMemory.exe" /NOLOGO
/LIBPATH:"D:\NVIDIA\CUDA\CUDAToolkit\lib\x64" "glut64.lib" "cudart.lib" "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" /MANIFEST /ManifestFile:"x64\Release\Chapter9HistGpuShareMemory.exe.intermediate.manifest" /ALLOWISOLATION
/MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"E:\learn_cuda_by_example_exercises\Chapter9HistGpuShareMemory\x64\Release\Chapter9HistGpuShareMemory.pdb" /OPT:REF /OPT:ICF /PGD:"E:\learn_cuda_by_example_exercises\Chapter9HistGpuShareMemory\x64\Release\Chapter9HistGpuShareMemory.pgd" /LTCG /TLBID:1
/DYNAMICBASE /NXCOMPAT /MACHINE:X64 /ERRORREPORT:QUEUE
CUDA Linker命令行:
# (Approximate command-line. Settings inherited from host are not visible below.)
# (Please see the output window after a build for the full command-line)
"D:\NVIDIA\CUDA\CUDAToolkit\bin\nvcc.exe" -dlink -o x64\Release\Chapter9HistGpuShareMemory.device-link.obj -Xcompiler "/EHsc /nologo /Zi "
答案 0 :(得分:2)
我认为这里的主要问题是硬件。
GeForce 705M只有48个CUDA核心。 GeForce GT 630有96或384,具体取决于版本。 K2200有640.范围卡的顶部是Titan X,有3072个CUDA核心。
将705M与K2200进行比较,您看到性能差异并不奇怪。