我目前在大学三年级学习 - 我的计算机系统和并发考试,我对过去的论文问题感到困惑。没有人 - 甚至是讲师 - 都回答了我的问题。
问题:
Consider the following GPU that consists of 8 multiprocessors clocked at 1.5 GHz, each of which contains 8 multithreaded single-precision floating-point units and integer processing units. It has a memory system that consists of 8 partitions of 1GHz Graphics DDR3DRAM, each 8 bytes wide and with 256 MB of capacity. Making reasonable assumptions (state them), and a naive matrix multiplication algorithm, compute how much time the computation C = A * B would take. A, B, and C are n * n matrices and n is determined by the amount of memory the system has.
解决方案中给出的答案:
> Assuming it has a single-precision FP multiply-add instruction,
Single-precision FP multiply-add performance =
\#MPs * #SP/MP * #FLOPs/instr/SP * #instr/clock * #clocks/sec =
8 * 8 * 2 * 1 * 1.5 G = 192 GFlops / second
Total DDR3RAM memory size = 8 * 256 MB = 2048 MB
The peak DDR3 bandwidth = #Partitions * #bytes/transfer * #transfers/clock * #clocks/sec = 8 * 8 * 2 * 1G = 128 GB/sec
>Modern computers have 32-bit single precision So, if we want 3 n*n SP matrices,
maximum n is
3n^2 * 4 <= 2048 * 1024 * 1024
>nmax = 13377 = n
>The number of operations that a naive mm algorithm (triply nested loop) needs is calculated as follows:
>For each element of the
result, we need n multiply-adds For each row of the result,
>we need n * n multiply-adds For the entire result matrix, we need n * n * n multiply-adds Thus, approximately 2393 GFlops.
> Assuming no cache, we have loading of 2 matrices and storing of 1 to the graphics memory.
>That is 3 * n^2 = 512 GB of data. This process will take 512 / 128 = 4 seconds
Also, the processing will take 2393 / 192 = 12.46 seconds Thus the
entire matrix multiplication will take 16.46 seconds.
现在我的问题是 - 如何计算3 *((13377)^ 2)= 536,832,387
翻译为536,832,387 = 512 GB。
这是536.8百万的价值。每个值长度为4个字节。存储器接口宽度为8个字节 - 假设GPU无法获取2个值并将其拆分 - 这有效地使读取和写入的大小翻倍。因此,使用的2GB内存被有效地读/写两次(因为8个字节被读取而4个被忽略)因此在RAM和GPU之间只传递4GB的数据。
有人可以告诉我哪里出错了,因为我能想到的唯一方法是536.8百万结果是以KB为单位的内存操作的值 - 在任何地方都没有说明。