Question

我的内核存档100％利用率，但kernel time仅为3％且存在no time overlap between memory copies and kernels。

特别是高利用率和低内核时间对我来说没有意义。

那么我该如何继续优化内核呢？

我已经确定，我只有合并并固定内存访问权限，就像推荐的分析器一样。

`Quadro FX 580 utilization = 100.00% (62117.00/62117.00)`

Kernel time = 3.05 % of total GPU time 
Memory copy time = 0.9 % of total GPU time
Kernel taking maximum time = Pinned (0.7% of total GPU time)
Memory copy taking maximum time = memcpyHtoD (0.5% of total GPU time)
There is no time overlap between memory copies and kernels on GPU

更进一步，我没有扭曲序列化，没有分支分支，也没有占用限制因素。

Kernel details: Grid size: [4 1 1], Block size: [256 1 1]
Register Ratio: 0.9375 ( 7680 / 8192 ) [10 registers per thread]
Shared Memory Ratio: 0.09375 ( 1536 / 16384 ) [60 bytes per Block]
Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8)
Active threads per SM: 768 (Maximum Active threads per SM: 768)
Potential Occupancy: 1 ( 24 / 24 )
Achieved occupancy: 0.333333 (on 4 SMs)
Occupancy limiting factor: None

P.S。我并不是说我写了wundercode，但我只是不知道如何从这里开始。

Answer 1

似乎内核的网格尺寸太小而无法充分利用SM。为什么不减小块大小和增加网格大小。我认为它会有所帮助。

使用`overlap`，`kernel time`和`utilization`来优化一个内核

1 个答案: