如何在不进行测量的情况下获取/计算GPU的内存延迟?

时间:2016-03-02 20:24:11

标签: memory cuda gpgpu latency

cudaGetDeviceProperties() API调用似乎没有告诉我们有关全局内存延迟的信息(甚至不是典型值,也不是最小/最大对等)。

编辑:当我说延迟时,我实际上意味着必须从主设备内存中读取数据的各种情况的不同延迟。因此,如果我们采用this paper,它实际上有6个数字:{TLB L1命中,TLB L2命中,TLB未命中} x L1数据缓存转为{开,关}。

Q1:有没有办法获得这些数字,而不是自己测量?
即使是基于SM版本,SM时钟和内存时钟的经验法则计算可能会这样做。

我会问第二个问题,是:
Q2:如果没有,是否有适用于您的实用程序?
(虽然这可能是该网站的偏离主题。)

2 个答案:

答案 0 :(得分:3)

cudaDeviceProperties()的目的是像x86 CPU上的等效cpuid工具一样,返回相关的微架构参数。与CPU一样,即使微架构参数相同(例如由于不同的时钟频率,或者由于所连接的DRAM的规格不同)以及这些与处理器内部的各种缓冲和缓存机制相互作用的方式,GPU上的性能特征也可能不同。一般来说,没有单一的内存延迟"第一号可以指定,也不知道从已知的微架构参数计算可能范围的方法。

在CPU和GPU上,因此必须利用复杂的微基准测试来确定性能参数,例如DRAM延迟。如何为每个所需参数构建这样的微基准标记将太宽泛而不能在此处覆盖。已经发表了多篇论文,详细讨论了NVDIA GPU。最早的相关出版物之一是(online draft):

Wong,Henry,et al。 "通过微基准测试揭开GPU微体系结构的神秘面纱。"在会议录:2010年IEEE国际系统性能分析研讨会&软件(ISPASS),第235-246页

最近的一项工作包括对开普勒架构的报道(online draft):

Xinxin Mei,Xiaowen Chu。 "通过微博标记解析GPU内存层次结构。" Arxiv手稿,2015年9月,第1-14页

除了构建自己的微基准测试之外,还必须依赖于已发布的结果,例如上面引用的结果,用于特定GPU的各种特定于实现的性能参数。

在GPU平台优化的多年中,我一直不需要了解这类数据,一般来说,CUDA分析器的性能指标应该足以追踪特定的瓶颈。

答案 1 :(得分:-2)

A1:关于GPU微体系结构延迟格局的详细介绍,包括您要询问的数据*,都在this http://ufdc.ufl.edu/UFE0043739/00001

Fig. 4.1 [符号多SM架构]和 Table 4-1. PTX指令类别,启动和执行周期 在开始构建GPU内核之前,应该在设计成本/收益方程时考虑到这一点。

Amdahl's Law finally helps quantitatively decide,CPU-GPU-CPU管道是否可以更快地完成任务。

A2: GPU指令代码模拟器是接收特定微体系结构必须花费的预期数量GPU-CLK的方式(至少,更多GPU) -kernels可能同时使用硬件资源,从而扩展了体内观察到的端到端延迟。<​​/ p>

数字很重要。

   Category                     GPU
   |                            Hardware
   |                            Unit
   |                            |            Throughput
   |                            |            |               Execution
   |                            |            |               Latency
   |                            |            |               |                  PTX instructions                                                      Note 
   |____________________________|____________|_______________|__________________|_____________________________________________________________________
   Load_shared                  LSU          2               +  30              ld, ldu                                                               Note, .ss = .shared ; .vec and .type determine the size of load. Note also that we omit .cop since no cacheable in Ocelot
   Load_global                  LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .global; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_local                   LSU          2               + 600              ld, ldu, prefetch, prefetchu                                          Note, .ss = .local; .vec and .type determine the size of load. Note, Ocelot may not generate prefetch since no caches
   Load_const                   LSU          2               + 600              ld, ldu                                                               Note, .ss = .const; .vec and .type determine the size of load
   Load_param                   LSU          2               +  30              ld, ldu                                                               Note, .ss = .param; .vec and .type determine the size of load
   |                            |                              
   Store_shared                 LSU          2               +  30              st                                                                    Note, .ss = .shared; .vec and .type determine the size of store
   Store_global                 LSU          2               + 600              st                                                                    Note, .ss = .global; .vec and .type determine the size of store
   Store_local                  LSU          2               + 600              st                                                                    Note, .ss = .local; .vec and .type determine the size of store
   Read_modify_write_shared     LSU          2               + 600              atom, red                                                             Note, .space = shared; .type determine the size
   Read_modify_write_global     LSU          2               + 600              atom, red                                                             Note, .space = global; .type determine the size
   |                            |                              
   Texture                      LSU          2               + 600              tex, txq, suld, sust, sured, suq
   |                            |                              
   Integer                      ALU          2               +  24              add, sub, add.cc, addc, sub.cc, subc, mul, mad, mul24, mad24, sad, div, rem, abs, neg, min, max, popc, clz, bfind, brev, bfe, bfi, prmt, mov
   |                            |                                                                                                                     Note, these integer inst. with type = { .u16, .u32, .u64, .s16, .s32, .s64 };
   |                            |                              
   Float_single                 ALU          2               +  24              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-single inst. with type = { .f32 };
   Float_double                 ALU          1               +  48              testp, copysign, add, sub, mul, fma, mad, div, abs, neg, min, max     Note, these Float-double inst. with type = { .f64 };
   Special_single               SFU          8               +  48              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-single with type = { .f32 };
   Special_double               SFU          8               +  72              rcp, sqrt, rsqrt, sin, cos, lg2, ex2                                  Note, these special-double with type = { .f64 };
   |                                                           
   Logical                      ALU          2               +  24              and, or, xor, not, cnot, shl, shr
   Control                      ALU          2               +  24              bra, call, ret, exit
   |                                                           
   Synchronization              ALU          2               +  24              bar, member, vote
   Compare & Select             ALU          2               +  24              set, setp, selp, slct
   |                                                           
   Conversion                   ALU          2               +  24              Isspacep, cvta, cvt
   Miscellanies                 ALU          2               +  24              brkpt, pmevent, trap
   Video                        ALU          2               +  24              vadd, vsub, vabsdiff, vmin, vmax, vshl, vshr, vmad, vset
+====================| + 11-12 [usec] XFER-LATENCY-up   HostToDevice    ~~~ same as Intel X48 / nForce 790i
|   |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
|   |||||||||||||||||| ~  5.5 GB/sec XFER-BW-up                         ~~~ same as DDR2/DDR3 throughput
|   |||||||||||||||||| ~  5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD      ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
|   ||||||||||||||||||
|   | PCIe-2.0 ( 4x) | ~ 4 GB/s over  4-Lanes ( PORT #2  )
|   | PCIe-2.0 ( 8x) | ~16 GB/s over  8-Lanes
|   | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
|   ||||||||||||||||||
+====================|
|                                                                                                                                        PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
|                                                       smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
|                                                                                                              +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^
|                                                                                                                          |                    ^^^^^^^^
|                                                                                                                       +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
|                                                                                                                          |                    ^^^^^^^^
|                                                                                                                   ~  +20 [ns] @1147 MHz FERMI ^^^^^^^^
|                                                             SM-REGISTERs/thread: max  63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
|                                                                                  max  63 for CC-3.0 -          about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
|                                                                                  max 128 for CC-1.x                                    PAR -- ||||||||~~~|
|                                                                                  max 255 for CC-3.5                                    PAR -- ||||||||||||||||||~~~~~~|
|
|                                                       smREGs___BW                                 ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE <<  -Xptxas -v          || nvcc -maxrregcount ( w|w/o spillover(s) )
|                                                                with about 8.0  TB/s BW            [C:Pg.46]
|                                                                           1.3  TB/s BW shaMEM___  4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
|                                                                           0.1  TB/s BW gloMEM___
|         ________________________________________________________________________________________________________________________________________________________________________________________________________________________
+========|   DEVICE:3 PERSISTENT                          gloMEM___
|       _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+======|   DEVICE:2 PERSISTENT                          gloMEM___
|     _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+====|   DEVICE:1 PERSISTENT                          gloMEM___
|   _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+==|   DEVICE:0 PERSISTENT                          gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
!  |                                                         |\                                                                +                                                                                           |
o  |                                                texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
   |                                                         |\ \                                 |\                           +                                               |\                                          |
   |                                              texL2cache_| \ \                               .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \                                   256_KB|
   |                                                         |  \ \                               |  \                         +                                 |\            ^  \                                        |
   |                                                         |   \ \                              |   \                        +                                 | \           ^   \                                       |
   |                                                         |    \ \                             |    \                       +                                 |  \          ^    \                                      |
   |                                              texL1cache_|     \ \                           .|     \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ |   \_ _ _ _ _^     \                                 5_KB|
   |                                                         |      \ \                           |      \                     +                         ^\      ^    \        ^\     \                                    |
   |                                     shaMEM + conL3cache_|       \ \                          |       \ _ _ _ _ conL3cache +220 [GPU_CLKs]           ^ \     ^     \       ^ \     \                              32_KB|
   |                                                         |        \ \                         |        \       ^\          +                         ^  \    ^      \      ^  \     \                                  |
   |                                                         |         \ \                        |         \      ^ \         +                         ^   \   ^       \     ^   \     \                                 |
   |                                   ______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________|
   |                  +220 [GPU-CLKs]_|           |_ _ _  ___|\          \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
   | L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB  L2_|_ _ _   __|\\          \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
   | L1-on-re-use-only +40 [GPU-CLKs]_|  8 KB  L1_|_ _ _    _|\\\          \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________|
   | L1-on-re-use-only + 8 [GPU-CLKs]_|  2 KB  L1_|__________|\\\\__________\_\__________________________________\________\____+  8 [GPU_CLKs]_________________________________________________________conL1cache      2_KB|
   |     on-chip|smREG +22 [GPU-CLKs]_|           |t[0_______^:~~~~~~~~~~~~~~~~\:________]
   |CC-  MAX    |_|_|_|_|_|_|_|_|_|_|_|           |t[1_______^                  :________]
   |2.x   63    |_|_|_|_|_|_|_|_|_|_|_|           |t[2_______^                  :________] 
   |1.x  128    |_|_|_|_|_|_|_|_|_|_|_|           |t[3_______^                  :________]
   |3.5  255 REGISTERs|_|_|_|_|_|_|_|_|           |t[4_______^                  :________]
   |         per|_|_|_|_|_|_|_|_|_|_|_|           |t[5_______^                  :________]
   |         Thread_|_|_|_|_|_|_|_|_|_|           |t[6_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[7_______^     1stHalf-WARP :________]______________
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ 9_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ A_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ B_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ C_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ D_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           |t[ E_______^                  :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|       W0..|t[ F_______^____________WARP__:________]_____________
   |            |_|_|_|_|_|_|_|_|_|_|_|         ..............             
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[0_______^:~~~~~~~~~~~~~~~\:________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[1_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[2_______^                 :________] 
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[3_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[4_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[5_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[6_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[7_______^    1stHalf-WARP :________]______________
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ 9_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ A_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ B_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ C_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ D_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|           ............|t[ E_______^                 :________]
   |            |_|_|_|_|_|_|_|_|_|_|_|       W1..............|t[ F_______^___________WARP__:________]_____________
   |            |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
   |
   |                   ________________          °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
   |                  /                \   CC-2.0|||||||||||||||||||||||||| ~masked  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
   |                 /                  \  1.hW  ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
   |                /                    \ 2.hW  |^|^|^|^|^|^|^|^|^|^|^|^|^          |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
   |_______________/                      \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
   |~~~~~~~~~~~~~~/ SM:0.warpScheduler    /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
   |              \          |           //
   |               \         RR-mode    //
   |                \    GREEDY-mode   //
   |                 \________________//
   |                   \______________/

GPU群集自我报告:

* FERMI
* GF100 Server/HPC-GPU PCIe-2.0-16x
*                      GPU_CLK 1.15 GHz [Graphics 575 MHz]
*                      3072 MB GDDR5 773MHz + ECC-correction
* 
*                       448-CUDA-COREs [SMX]-s --> 14 SM * warpSize == 448
*                        48-ROPs
*                        56-TEX-Units, 400 MHz RAMDAC
CUDA API reports .self to operate an API-Driver-version [5000]
                                  on RT-version         [5000]

CUDA API reports .self to operate a limited FIFO [Host] <-|buffer| <-[Device] of a size of 1048576 [B]

CUDA API reports .self to operate a limited HEAP                  for[Device]-side Dynamic __global__ Memory Allocations in a size of 8388608 [B] ( 8 MB if not specified in malloc() call )

CUDA API reports .self to operate cudaCreateStreamWithPriority() QUEUEs
                                                  with <_stream_PRIO_LOW____> == 725871085
                                                  with <_stream_PRIO_HIGH___> == 0


  CUDA Device:0_ has <_compute capability_> == 2.0.
  CUDA Device:0_ has [    Tesla M2050] .name
  CUDA Device:0_ has [             14] .multiProcessorCount         [ Number of multiprocessors on device ]
  CUDA Device:0_ has [     2817982464] .totalGlobalMem              [ __global__   memory available on device in Bytes [B] ]
  CUDA Device:0_ has [          65536] .totalConstMem               [ __constant__ memory available on device in Bytes [B] ]
  CUDA Device:0_ has [        1147000] .clockRate                   [ GPU_CLK frequency in kilohertz [kHz] ]
  CUDA Device:0_ has [             32] .warpSize                    [ GPU WARP size in threads ]
  CUDA Device:0_ has [        1546000] .memoryClockRate             [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
  CUDA Device:0_ has [            384] .memoryBusWidth              [ GPU_DDR Global memory bus width in bits [b] ]
  CUDA Device:0_ has [           1024] .maxThreadsPerBlock          [ MAX Threads per Block ]
  CUDA Device:0_ has [          32768] .regsPerBlock                [ MAX number of 32-bit Registers available per Block ]
  CUDA Device:0_ has [           1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
  CUDA Device:0_ has [         786432] .l2CacheSize
  CUDA Device:0_ has [          49152] .sharedMemPerBlock           [ __shared__   memory available per Block in Bytes [B] ]
  CUDA Device:0_ has [              2] .asyncEngineCount            [ a number of asynchronous engines ]
  CUDA Device:0_ has [              1] .deviceOverlap               [ if Device can concurrently copy memory and execute a kernel ]
  CUDA Device:0_ has [              0] .kernelExecTimeoutEnabled    [ if there is a run time limit on kernel exec-s ]
  CUDA Device:0_ has [              1] .concurrentKernels           [ if Device can possibly execute multiple kernels concurrently ]
  CUDA Device:0_ has [              1] .canMapHostMemory            [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
  CUDA Device:0_ has [              3] .computeMode                 [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
  CUDA Device:0_ has [              1] .ECCEnabled                  [ if has ECC support enabled ]
  CUDA Device:0_ has [     2147483647] .memPitch                    [ MAX pitch in bytes allowed by memory copies [B] ]
  CUDA Device:0_ has [          65536] .maxSurface1D                [ MAX 1D surface size ]
  CUDA Device:0_ has [          32768] .maxSurfaceCubemap           [ MAX Cubemap surface dimensions ]
  CUDA Device:0_ has [          65536] .maxTexture1D                [ MAX 1D Texture size ]
  CUDA Device:0_ has [              0] .pciBusID                    [ PCI bus ID of the device ]
  CUDA Device:0_ has [              0] .integrated                  [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
  CUDA Device:0_ has [              1] .unifiedAddressing           [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]

  CUDA Device:1_ has <_compute capability_> == 2.0.
  CUDA Device:1_ has [    Tesla M2050] .name
  CUDA Device:1_ has [             14] .multiProcessorCount         [ Number of multiprocessors on device ]
  CUDA Device:1_ has [     2817982464] .totalGlobalMem              [ __global__   memory available on device in Bytes [B] ]
  CUDA Device:1_ has [          65536] .totalConstMem               [ __constant__ memory available on device in Bytes [B] ]
  CUDA Device:1_ has [        1147000] .clockRate                   [ GPU_CLK frequency in kilohertz [kHz] ]
  CUDA Device:1_ has [             32] .warpSize                    [ GPU WARP size in threads ]
  CUDA Device:1_ has [        1546000] .memoryClockRate             [ GPU_DDR Peak memory clock frequency in kilohertz [kHz] ]
  CUDA Device:1_ has [            384] .memoryBusWidth              [ GPU_DDR Global memory bus width in bits [b] ]
  CUDA Device:1_ has [           1024] .maxThreadsPerBlock          [ MAX Threads per Block ]
  CUDA Device:1_ has [          32768] .regsPerBlock                [ MAX number of 32-bit Registers available per Block ]
  CUDA Device:1_ has [           1536] .maxThreadsPerMultiProcessor [ MAX resident Threads per multiprocessor ]
  CUDA Device:1_ has [         786432] .l2CacheSize
  CUDA Device:1_ has [          49152] .sharedMemPerBlock           [ __shared__   memory available per Block in Bytes [B] ]
  CUDA Device:1_ has [              2] .asyncEngineCount            [ a number of asynchronous engines ]
  CUDA Device:1_ has [              1] .deviceOverlap               [ if Device can concurrently copy memory and execute a kernel ]
  CUDA Device:1_ has [              0] .kernelExecTimeoutEnabled    [ if there is a run time limit on kernel exec-s ]
  CUDA Device:1_ has [              1] .concurrentKernels           [ if Device can possibly execute multiple kernels concurrently ]
  CUDA Device:1_ has [              1] .canMapHostMemory            [ if can map host memory with cudaHostAlloc / cudaHostGetDevicePointer ]
  CUDA Device:1_ has [              3] .computeMode                 [ enum { 0: Default | 1: Exclusive<thread> | 2: Prohibited | 3: Exclusive<Process> } ]
  CUDA Device:1_ has [              1] .ECCEnabled                  [ if has ECC support enabled ]
  CUDA Device:1_ has [     2147483647] .memPitch                    [ MAX pitch in bytes allowed by memory copies [B] ]
  CUDA Device:1_ has [          65536] .maxSurface1D                [ MAX 1D surface size ]
  CUDA Device:1_ has [          32768] .maxSurfaceCubemap           [ MAX Cubemap surface dimensions ]
  CUDA Device:1_ has [          65536] .maxTexture1D                [ MAX 1D Texture size ]
  CUDA Device:1_ has [              0] .pciBusID                    [ PCI bus ID of the device ]
  CUDA Device:1_ has [              0] .integrated                  [ if GPU-hardware is integrated with Host-side ( ref. Page-Locked Memory XFERs ) ]
  CUDA Device:1_ has [              1] .unifiedAddressing           [ if can use 64-bit process Unified Virtual Address Space in CC-2.0+ ]