对于gpgpu卡,如nvidia显卡,这是打包复杂数据的最佳方式,无论是(分割还是交错方法)和为什么????
答案 0 :(得分:1)
假设硬件具有 3个内存通道且复数为fp32类型,
交错模式:
complex number: C0 C1 C2 C3 C4
bytes: 8 8 8 8 8
memory channel: 01201201 20120120 12012012 01201201 20120120
channel-0 usage: 13 times
channel-1 usage: 13 times
channel-2 usage: 13 times
拆分模式:
real part: r0 r1 r2 r3 r4
bytes: 4 4 4 4 4
memory channel:0120 1201 2012 0120 1201
imaginary has same pattern
channel-0 usage: 2x7 = 14 times
channel-1 usage: 2x7 = 14 times
channel-2 usage: 2x6 = 12 times
因此,当使用3个内存通道在分离模式下读取5个复数时,会使其中一个通道访问次优。
现在我们假设我们只读取偶数(或仅奇数)索引的复数,就像进行一些fft操作一样,
交错模式:
complex number: C0 x C2 x C4
bytes: 8 x 8 x 8
memory channel: 01201201 x 12012012 x 20120120
channel-0 usage: 8 times
channel-1 usage: 8 times
channel-2 usage: 8 times
拆分模式:
real part: r0 x r2 x r4
bytes: 4 x 4 x 4
memory channel:0120 x 2012 x 1201
imaginary has same pattern
channel-0 usage: 2x4 = 8 times
channel-1 usage: 2x4 = 8 times
channel-2 usage: 2x4 = 8 times
因此3通道硬件不会受到太大影响。
现在让我们看一下 8通道内存访问:
交错模式:
complex number: C0 C1 C2 C3 C4
bytes: 8 8 8 8 8
memory channel: 01234567 01234567 01234567 01234567 01234567
channel-0 usage: 1 times
channel-1 usage: 1 times
channel-2 usage: 1 times
channel-3 usage: 1 times
channel-4 usage: 1 times
channel-5 usage: 1 times
channel-6 usage: 1 times
channel-7 usage: 1 times
%100 bandwidth
拆分模式:
real part: r0 r1 r2 r3 r4
bytes: 4 4 4 4 4
memory channel:0123 4567 0123 4567 0123
imaginary has same pattern
channel-0 usage: 2x3 = 6 times
channel-1 usage: 2x3 = 6 times
channel-2 usage: 2x3 = 6 times
channel-3 usage: 2x3 = 6 times
channel-4 usage: 2x2 = 4 times
channel-5 usage: 2x2 = 4 times
channel-6 usage: 2x2 = 4 times
channel-7 usage: 2x2 = 4 times
half channels are used %50 more times than other half! %75 bandwidth
所以它们看起来是平等的,直到我们回到具有奇数访问权限甚至只能访问的fft示例:
交错模式:
complex number: C0 C1 C2 C3 C4
bytes: 8 x 8 x 8
memory channel: 01234567 x 01234567 x 01234567
channel-0 usage: 3 times
channel-1 usage: 3 times
channel-2 usage: 3 times
channel-3 usage: 3 times
channel-4 usage: 3 times
channel-5 usage: 3 times
channel-6 usage: 3 times
channel-7 usage: 3 times
%100 bandwidth
交错模式仍然有效。
拆分模式:
real part: r0 r1 r2 r3 r4
bytes: 4 x 4 x 4
memory channel:0123 x 0123 x 0123
imaginary has same pattern
channel-0 usage: 2x5 = 10 times
channel-1 usage: 2x5 = 10 times
channel-2 usage: 2x5 = 10 times
channel-3 usage: 2x5 = 10 times
channel 4-7 not used! %50 bandiwdth
所以在某些情况下,当使用分割模式非连续访问某些项目时,分割模式可能与%50一样慢。
您应该对偶数访问与完全访问进行基准测试,以了解要使用的类型。