cudaGetDeviceProperties()
API调用似乎没有告诉我们有关全局内存延迟的信息(甚至不是典型值,也不是最小/最大对等)。
编辑:当我说延迟时,我实际上意味着必须从主设备内存中读取数据的各种情况的不同延迟。因此,如果我们采用this paper,它实际上有6个数字:{TLB L1命中,TLB L2命中,TLB未命中} x L1数据缓存转为{开,关}。
Q1:有没有办法获得这些数字,而不是自己测量?
即使是基于SM版本,SM时钟和内存时钟的经验法则计算可能会这样做。
我会问第二个问题,是:
Q2:如果没有,是否有适用于您的实用程序?
(虽然这可能是该网站的偏离主题。)
答案 0 :(得分:3)
cudaDeviceProperties()
的目的是像x86 CPU上的等效cpuid
工具一样,返回相关的微架构参数。与CPU一样,即使微架构参数相同(例如由于不同的时钟频率,或者由于所连接的DRAM的规格不同)以及这些与处理器内部的各种缓冲和缓存机制相互作用的方式,GPU上的性能特征也可能不同。一般来说,没有单一的内存延迟"第一号可以指定,也不知道从已知的微架构参数计算可能范围的方法。
在CPU和GPU上,因此必须利用复杂的微基准测试来确定性能参数,例如DRAM延迟。如何为每个所需参数构建这样的微基准标记将太宽泛而不能在此处覆盖。已经发表了多篇论文,详细讨论了NVDIA GPU。最早的相关出版物之一是(online draft):
Wong,Henry,et al。 "通过微基准测试揭开GPU微体系结构的神秘面纱。"在会议录:2010年IEEE国际系统性能分析研讨会&软件(ISPASS),第235-246页
最近的一项工作包括对开普勒架构的报道(online draft):
除了构建自己的微基准测试之外,还必须依赖于已发布的结果,例如上面引用的结果,用于特定GPU的各种特定于实现的性能参数。
在GPU平台优化的多年中,我一直不需要了解这类数据,一般来说,CUDA分析器的性能指标应该足以追踪特定的瓶颈。
答案 1 :(得分:-2)
A1:关于GPU微体系结构延迟格局的详细介绍,包括您要询问的数据*,都在this http://ufdc.ufl.edu/UFE0043739/00001
Fig. 4.1
[符号多SM架构]和 Table 4-1.
PTX指令类别,启动和执行周期
在开始构建GPU内核之前,应该在设计成本/收益方程时考虑到这一点。
Amdahl's Law finally helps quantitatively decide,CPU-GPU-CPU管道是否可以更快地完成任务。
A2:
GPU指令代码模拟器是接收特定微体系结构必须花费的预期数量GPU-CLK
的方式(至少,更多GPU) -kernels可能同时使用硬件资源,从而扩展了体内观察到的端到端延迟。</ p>
数字很重要。
Category GPU
| Hardware
| Unit
| | Throughput
| | | Execution
| | | Latency
| | | | PTX instructions Note
|____________________________|____________|_______________|__________________|_____________________________________________________________________
Load_shared LSU 2 + 30 ld, ldu Note, .ss = .shared ; .vec and .type determine the size of load. Note also that we omit .cop since no cacheable in Ocelot
Load_global LSU 2 + 600 ld, ldu, prefetch, prefetchu Note, .ss = .global; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
Load_local LSU 2 + 600 ld, ldu, prefetch, prefetchu Note, .ss = .local; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
Load_const LSU 2 + 600 ld, ldu Note, .ss = .const; .vec and .type determine the size of load
Load_param LSU 2 + 30 ld, ldu Note, .ss = .param; .vec and .type determine the size of load
| |
Store_shared LSU 2 + 30 st Note, .ss = .shared; .vec and .type determine the size of store
Store_global LSU 2 + 600 st Note, .ss = .global; .vec and .type determine the size of store
Store_local LSU 2 + 600 st Note, .ss = .local; .vec and .type determine the size of store
Read_modify_write_shared LSU 2 + 600 atom, red Note, .space = shared; .type determine the size
Read_modify_write_global LSU 2 + 600 atom, red Note, .space = global; .type determine the size
| |
Texture LSU 2 + 600 tex, txq, suld, sust, sured, suq
| |
Integer ALU 2 + 24 add, sub, add.cc, addc, sub.cc, subc, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max, popc, clz, bfind, brev, bfe, bfi, prmt, mov
| | Note, these integer inst. with type = { .u16, .u32, .u64, .s16, .s32, .s64 };
| |
Float_single ALU 2 + 24 testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max Note, these Float-single inst. with type = { .f32 };
Float_double ALU 1 + 48 testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max Note, these Float-double inst. with type = { .f64 };
Special_single SFU 8 + 48 rcp, sqrt, rsqrt, sin, cos, lg2, ex2 Note, these special-single with type = { .f32 };
Special_double SFU 8 + 72 rcp, sqrt, rsqrt, sin, cos, lg2, ex2 Note, these special-double with type = { .f64 };
|
Logical ALU 2 + 24 and, or, xor, not, cnot, shl, shr
Control ALU 2 + 24 bra, call, ret, exit
|
Synchronization ALU 2 + 24 bar, member, vote
Compare & Select ALU 2 + 24 set, setp, selp, slct
|
Conversion ALU 2 + 24 Isspacep, cvta, cvt
Miscellanies ALU 2 + 24 brkpt, pmevent, trap
Video ALU 2 + 24 vadd, vsub, vabsdiff, vmin, vmax, vshl, vshr, vmad, vset
+====================| + 11-12 [usec] XFER-LATENCY-up HostToDevice ~~~ same as Intel X48 / nForce 790i
| |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
| |||||||||||||||||| ~ 5.5 GB/sec XFER-BW-up ~~~ same as DDR2/DDR3 throughput
| |||||||||||||||||| ~ 5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
| ||||||||||||||||||
| | PCIe-2.0 ( 4x) | ~ 4 GB/s over 4-Lanes ( PORT #2 )
| | PCIe-2.0 ( 8x) | ~16 GB/s over 8-Lanes
| | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
| ||||||||||||||||||
+====================|
| PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
| smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
| +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^
| | ^^^^^^^^
| +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
| | ^^^^^^^^
| ~ +20 [ns] @1147 MHz FERMI ^^^^^^^^
| SM-REGISTERs/thread: max 63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
| max 63 for CC-3.0 - about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
| max 128 for CC-1.x PAR -- ||||||||~~~|
| max 255 for CC-3.5 PAR -- ||||||||||||||||||~~~~~~|
|
| smREGs___BW ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE << -Xptxas -v || nvcc -maxrregcount ( w|w/o spillover(s) )
| with about 8.0 TB/s BW [C:Pg.46]
| 1.3 TB/s BW shaMEM___ 4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
| 0.1 TB/s BW gloMEM___
| ________________________________________________________________________________________________________________________________________________________________________________________________________________________
+========| DEVICE:3 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+======| DEVICE:2 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+====| DEVICE:1 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+==| DEVICE:0 PERSISTENT gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
! | |\ + |
o | texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
| |\ \ |\ + |\ |
| texL2cache_| \ \ .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \ 256_KB|
| | \ \ | \ + |\ ^ \ |
| | \ \ | \ + | \ ^ \ |
| | \ \ | \ + | \ ^ \ |
| texL1cache_| \ \ .| \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ | \_ _ _ _ _^ \ 5_KB|
| | \ \ | \ + ^\ ^ \ ^\ \ |
| shaMEM + conL3cache_| \ \ | \ _ _ _ _ conL3cache +220 [GPU_CLKs] ^ \ ^ \ ^ \ \ 32_KB|
| | \ \ | \ ^\ + ^ \ ^ \ ^ \ \ |
| | \ \ | \ ^ \ + ^ \ ^ \ ^ \ \ |
| ______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________|
| +220 [GPU-CLKs]_| |_ _ _ ___|\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
| L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB L2_|_ _ _ __|\\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
| L1-on-re-use-only +40 [GPU-CLKs]_| 8 KB L1_|_ _ _ _|\\\ \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________|
| L1-on-re-use-only + 8 [GPU-CLKs]_| 2 KB L1_|__________|\\\\__________\_\__________________________________\________\____+ 8 [GPU_CLKs]_________________________________________________________conL1cache 2_KB|
| on-chip|smREG +22 [GPU-CLKs]_| |t[0_______^:~~~~~~~~~~~~~~~~\:________]
|CC- MAX |_|_|_|_|_|_|_|_|_|_|_| |t[1_______^ :________]
|2.x 63 |_|_|_|_|_|_|_|_|_|_|_| |t[2_______^ :________]
|1.x 128 |_|_|_|_|_|_|_|_|_|_|_| |t[3_______^ :________]
|3.5 255 REGISTERs|_|_|_|_|_|_|_|_| |t[4_______^ :________]
| per|_|_|_|_|_|_|_|_|_|_|_| |t[5_______^ :________]
| Thread_|_|_|_|_|_|_|_|_|_| |t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W0..|t[ F_______^____________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_| ..............
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[0_______^:~~~~~~~~~~~~~~~\:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[1_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[2_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[3_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[4_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[5_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W1..............|t[ F_______^___________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
|
| ________________ °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
| / \ CC-2.0|||||||||||||||||||||||||| ~masked ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| / \ 1.hW ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
| / \ 2.hW |^|^|^|^|^|^|^|^|^|^|^|^|^ |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
|_______________/ \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
|~~~~~~~~~~~~~~/ SM:0.warpScheduler /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
| \ | //
| \ RR-mode //
| \ GREEDY-mode //
| \________________//
| \______________/
* FERMI
* GF100 Server/HPC-GPU PCIe-2.0-16x
* GPU_CLK 1.15 GHz [Graphics 575 MHz]
* 3072 MB GDDR5 773MHz + ECC-correction
*
* 448-CUDA-COREs [SMX]-s --> 14 SM * warpSize == 448
* 48-ROPs
* 56-TEX-Units, 400 MHz RAMDAC
CUDA API reports .self to operate an API-Driver-version [5000]
on RT-version [5000]
CUDA API reports .self to operate a limited FIFO [Host] <-|buffer| <-[Device] of a size of 1048576 [B]
CUDA API reports .self to operate a limited HEAP for[Device]-side Dynamic __global__ Memory Allocations in a size of 8388608 [B] ( 8 MB if not specified in malloc() call )
CUDA API reports .self to operate cudaCreateStreamWithPriority() QUEUEs
with <_stream_PRIO_LOW____> == 725871085
with <_stream_PRIO_HIGH___> == 0
CUDA Device:0_ has <_compute capability_> == 2.0.
CUDA Device:0_ has [ Tesla M2050] .name
CUDA Device:0_ has [ 14] .multiProcessorCount [ Number of multiprocessors on device ]
CUDA Device:0_ has [ 2817982464] .totalGlobalMem [ __global__ memory available on device in Bytes [B] ]
CUDA Device:0_ has [ 65536] .totalConstMem [ __constant__ memory available on device in Bytes [B] ]
CUDA Device:0_ has [ 1147000] .clockRate [ GPU_CLK frequency in kilohertz [kHz] ]
CUDA Device:0_ has [ 32] .warpSize [ GPU WARP size in threads ]
CUDA Device:0_ has [ 1546000] .memoryClockRate [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
CUDA Device:0_ has [ 384] .memoryBusWidth [ GPU_DDR Global memory bus width in bits [b] ]
CUDA Device:0_ has [ 1024] .maxThreadsPerBlock [ MAX Threads per Block ]
CUDA Device:0_ has [ 32768] .regsPerBlock [ MAX number of 32-bit Registers available per Block ]
CUDA Device:0_ has [ 1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
CUDA Device:0_ has [ 786432] .l2CacheSize
CUDA Device:0_ has [ 49152] .sharedMemPerBlock [ __shared__ memory available per Block in Bytes [B] ]
CUDA Device:0_ has [ 2] .asyncEngineCount [ a number of asynchronous engines ]
CUDA Device:0_ has [ 1] .deviceOverlap [ if Device can concurrently copy memory and execute a kernel ]
CUDA Device:0_ has [ 0] .kernelExecTimeoutEnabled [ if there is a run time limit on kernel exec-s ]
CUDA Device:0_ has [ 1] .concurrentKernels [ if Device can possibly execute multiple kernels concurrently ]
CUDA Device:0_ has [ 1] .canMapHostMemory [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
CUDA Device:0_ has [ 3] .computeMode [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
CUDA Device:0_ has [ 1] .ECCEnabled [ if has ECC support enabled ]
CUDA Device:0_ has [ 2147483647] .memPitch [ MAX pitch in bytes allowed by memory copies [B] ]
CUDA Device:0_ has [ 65536] .maxSurface1D [ MAX 1D surface size ]
CUDA Device:0_ has [ 32768] .maxSurfaceCubemap [ MAX Cubemap surface dimensions ]
CUDA Device:0_ has [ 65536] .maxTexture1D [ MAX 1D Texture size ]
CUDA Device:0_ has [ 0] .pciBusID [ PCI bus ID of the device ]
CUDA Device:0_ has [ 0] .integrated [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
CUDA Device:0_ has [ 1] .unifiedAddressing [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]
CUDA Device:1_ has <_compute capability_> == 2.0.
CUDA Device:1_ has [ Tesla M2050] .name
CUDA Device:1_ has [ 14] .multiProcessorCount [ Number of multiprocessors on device ]
CUDA Device:1_ has [ 2817982464] .totalGlobalMem [ __global__ memory available on device in Bytes [B] ]
CUDA Device:1_ has [ 65536] .totalConstMem [ __constant__ memory available on device in Bytes [B] ]
CUDA Device:1_ has [ 1147000] .clockRate [ GPU_CLK frequency in kilohertz [kHz] ]
CUDA Device:1_ has [ 32] .warpSize [ GPU WARP size in threads ]
CUDA Device:1_ has [ 1546000] .memoryClockRate [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
CUDA Device:1_ has [ 384] .memoryBusWidth [ GPU_DDR Global memory bus width in bits [b] ]
CUDA Device:1_ has [ 1024] .maxThreadsPerBlock [ MAX Threads per Block ]
CUDA Device:1_ has [ 32768] .regsPerBlock [ MAX number of 32-bit Registers available per Block ]
CUDA Device:1_ has [ 1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
CUDA Device:1_ has [ 786432] .l2CacheSize
CUDA Device:1_ has [ 49152] .sharedMemPerBlock [ __shared__ memory available per Block in Bytes [B] ]
CUDA Device:1_ has [ 2] .asyncEngineCount [ a number of asynchronous engines ]
CUDA Device:1_ has [ 1] .deviceOverlap [ if Device can concurrently copy memory and execute a kernel ]
CUDA Device:1_ has [ 0] .kernelExecTimeoutEnabled [ if there is a run time limit on kernel exec-s ]
CUDA Device:1_ has [ 1] .concurrentKernels [ if Device can possibly execute multiple kernels concurrently ]
CUDA Device:1_ has [ 1] .canMapHostMemory [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
CUDA Device:1_ has [ 3] .computeMode [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
CUDA Device:1_ has [ 1] .ECCEnabled [ if has ECC support enabled ]
CUDA Device:1_ has [ 2147483647] .memPitch [ MAX pitch in bytes allowed by memory copies [B] ]
CUDA Device:1_ has [ 65536] .maxSurface1D [ MAX 1D surface size ]
CUDA Device:1_ has [ 32768] .maxSurfaceCubemap [ MAX Cubemap surface dimensions ]
CUDA Device:1_ has [ 65536] .maxTexture1D [ MAX 1D Texture size ]
CUDA Device:1_ has [ 0] .pciBusID [ PCI bus ID of the device ]
CUDA Device:1_ has [ 0] .integrated [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
CUDA Device:1_ has [ 1] .unifiedAddressing [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]