我遵循https://devblogs.nvidia.com/even-easier-introduction-cuda/中的示例。
从<<< 1,1 >>>到<<< 1,128 >>>,我可以按照里面的描述来复制提速。
但是从<<< 1,128 >>>到多个块,我无法复制相同的提速,它与<<< 1,128 >>>保持类似的运行时间。
我的系统有Titan Xp:
使用GPU设备0:TITAN Xp
SM号:30
每个线程块的共享内存大小:48 KB
每个线程块中的大#threads:1024
每个SM的最大#thread:2048
每个SM的最大#个自动换行次数:64
在两种情况下都使用nvprof时,
对于<<< 1,128 >>>,我得到了:
==31321== NVPROF is profiling process 31321, command: ./mat_add
Maximum error 0
==31321== Profiling application: ./mat_add
==31321== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 5.2934ms 1 5.2934ms 5.2934ms 5.2934ms gpu_add(int, float*, float*)
API calls: 95.26% 177.40ms 2 88.699ms 62.520us 177.34ms cudaMallocManaged
2.84% 5.2962ms 1 5.2962ms 5.2962ms 5.2962ms cudaDeviceSynchronize
0.85% 1.5757ms 282 5.5870us 113ns 263.86us cuDeviceGetAttribute
0.66% 1.2238ms 3 407.92us 384.86us 450.69us cuDeviceTotalMem
0.27% 507.51us 2 253.75us 222.46us 285.05us cudaFree
0.08% 154.10us 3 51.367us 44.987us 62.794us cuDeviceGetName
0.03% 58.399us 1 58.399us 58.399us 58.399us cudaLaunch
0.00% 9.0700us 3 3.0230us 361ns 7.6780us cudaSetupArgument
0.00% 2.7860us 1 2.7860us 2.7860us 2.7860us cudaConfigureCall
0.00% 2.1160us 3 705ns 183ns 1.7060us cuDeviceGetCount
0.00% 1.3710us 6 228ns 149ns 446ns cuDeviceGet
==31321== Unified Memory profiling result:
Device "TITAN Xp (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
48 170.67KB 4.0000KB 0.9961MB 8.000000MB 2.751616ms Host To Device
24 170.67KB 4.0000KB 0.9961MB 4.000000MB 1.285856ms Device To Host
12 - - - - 4.464224ms Gpu page fault groups
Total CPU Page faults: 36
对于<<<(N + blockSize-1)/ blockSize,blockSize = 256 >>>,我得到了:
==31499== NVPROF is profiling process 31499, command: ./mat_add
Maximum error 0
==31499== Profiling application: ./mat_add
==31499== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
**GPU activities: 100.00% 5.3866ms 1 5.3866ms 5.3866ms 5.3866ms gpu_add(int, float*, float*)**
API calls: 95.31% 174.40ms 2 87.198ms 63.448us 174.33ms cudaMallocManaged
2.94% 5.3864ms 1 5.3864ms 5.3864ms 5.3864ms cudaDeviceSynchronize
0.81% 1.4781ms 282 5.2410us 98ns 261.88us cuDeviceGetAttribute
0.53% 970.90us 3 323.63us 260.92us 387.97us cuDeviceTotalMem
0.29% 532.47us 2 266.23us 228.50us 303.97us cudaFree
0.08% 141.06us 3 47.018us 38.669us 58.495us cuDeviceGetName
0.04% 65.944us 1 65.944us 65.944us 65.944us cudaLaunch
0.00% 8.9370us 3 2.9790us 366ns 7.5420us cudaSetupArgument
0.00% 3.3580us 1 3.3580us 3.3580us 3.3580us cudaConfigureCall
0.00% 1.8850us 3 628ns 166ns 1.5220us cuDeviceGetCount
0.00% 1.5330us 6 255ns 114ns 757ns cuDeviceGet
==31499== Unified Memory profiling result:
Device "TITAN Xp (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
109 75.155KB 4.0000KB 972.00KB 8.000000MB 2.818496ms Host To Device
24 170.67KB 4.0000KB 0.9961MB 4.000000MB 1.285344ms Device To Host
15 - - - - 5.468096ms Gpu page fault groups
Total CPU Page faults: 36
我想知道这可能是什么原因?谢谢!