打包复数(Interleave vs Split)方法

时间:2016-11-24 11:02:46

标签: opencl

对于gpgpu卡,如nvidia显卡,这是打包复杂数据的最佳方式,无论是(分割还是交错方法)和为什么????

1 个答案:

答案 0 :(得分:1)

假设硬件具有 3个内存通道且复数为fp32类型,

交错模式:

complex number: C0         C1         C2         C3          C4 
bytes:          8          8          8          8           8   
memory channel: 01201201   20120120   12012012   01201201    20120120
channel-0 usage: 13 times
channel-1 usage: 13 times
channel-2 usage: 13 times

拆分模式:

real part:     r0         r1          r2         r3          r4
bytes:         4          4           4          4           4
memory channel:0120       1201        2012       0120        1201
imaginary has same pattern
channel-0 usage: 2x7 = 14 times
channel-1 usage: 2x7 = 14 times
channel-2 usage: 2x6 = 12 times 

因此,当使用3个内存通道在分离模式下读取5个复数时,会使其中一个通道访问次优。

现在我们假设我们只读取偶数(或仅奇数)索引的复数,就像进行一些fft操作一样,

交错模式:

complex number: C0         x   C2         x          C4 
bytes:          8          x   8          x           8   
memory channel: 01201201   x   12012012   x    20120120
channel-0 usage: 8 times
channel-1 usage: 8 times
channel-2 usage: 8 times

拆分模式:

real part:     r0         x          r2       x          r4
bytes:         4          x           4       x           4
memory channel:0120       x        2012       x        1201
imaginary has same pattern
channel-0 usage: 2x4 = 8 times
channel-1 usage: 2x4 = 8 times
channel-2 usage: 2x4 = 8 times 

因此3通道硬件不会受到太大影响。

现在让我们看一下 8通道内存访问:

交错模式:

complex number: C0         C1         C2         C3          C4 
bytes:          8          8          8          8           8   
memory channel: 01234567   01234567   01234567   01234567    01234567
channel-0 usage: 1 times
channel-1 usage: 1 times
channel-2 usage: 1 times
channel-3 usage: 1 times
channel-4 usage: 1 times
channel-5 usage: 1 times
channel-6 usage: 1 times
channel-7 usage: 1 times
%100 bandwidth

拆分模式:

real part:     r0         r1          r2         r3          r4
bytes:         4          4           4          4           4
memory channel:0123       4567        0123       4567        0123
imaginary has same pattern
channel-0 usage: 2x3 = 6 times
channel-1 usage: 2x3 = 6 times
channel-2 usage: 2x3 = 6 times
channel-3 usage: 2x3 = 6 times
channel-4 usage: 2x2 = 4 times
channel-5 usage: 2x2 = 4 times
channel-6 usage: 2x2 = 4 times
channel-7 usage: 2x2 = 4 times
half channels are used %50 more times than other half! %75 bandwidth

所以它们看起来是平等的,直到我们回到具有奇数访问权限甚至只能访问的fft示例:

交错模式:

complex number: C0         C1         C2         C3          C4 
bytes:          8          x          8          x           8   
memory channel: 01234567   x          01234567   x           01234567
channel-0 usage: 3 times
channel-1 usage: 3 times
channel-2 usage: 3 times
channel-3 usage: 3 times
channel-4 usage: 3 times
channel-5 usage: 3 times
channel-6 usage: 3 times
channel-7 usage: 3 times
%100 bandwidth

交错模式仍然有效。

拆分模式:

real part:     r0         r1          r2         r3          r4
bytes:         4          x           4          x           4
memory channel:0123       x           0123       x           0123
imaginary has same pattern
channel-0 usage: 2x5 = 10 times
channel-1 usage: 2x5 = 10 times
channel-2 usage: 2x5 = 10 times
channel-3 usage: 2x5 = 10 times
channel 4-7 not used! %50 bandiwdth

所以在某些情况下,当使用分割模式非连续访问某些项目时,分割模式可能与%50一样慢。

您应该对偶数访问与完全访问进行基准测试,以了解要使用的类型。