并发矩阵求和 - 过去的考试卷

时间:2014-05-21 12:27:39

标签: concurrency parallel-processing floating-point gpu computer-architecture

我目前在大学三年级学习 - 我的计算机系统和并发考试,我对过去的论文问题感到困惑。没有人 - 甚至是讲师 - 都回答了我的问题。

问题:

  

Consider the following GPU that consists of 8 multiprocessors clocked at 1.5 GHz, each of which contains 8 multithreaded single-precision floating-point units and integer processing units. It has a memory system that consists of 8 partitions of 1GHz Graphics DDR3DRAM, each 8 bytes wide and with 256 MB of capacity. Making reasonable assumptions (state them), and a naive matrix multiplication algorithm, compute how much time the computation C = A * B would take. A, B, and C are n * n matrices and n is determined by the amount of memory the system has.

解决方案中给出的答案:

> Assuming it has a single-precision FP multiply-add instruction,   
 Single-precision FP multiply-add performance =   
 \#MPs * #SP/MP * #FLOPs/instr/SP * #instr/clock * #clocks/sec =  
8 * 8 * 2 * 1 * 1.5 G = 192 GFlops / second   
Total DDR3RAM memory size = 8 * 256 MB = 2048 MB 
The peak DDR3 bandwidth =   #Partitions * #bytes/transfer * #transfers/clock * #clocks/sec = 8 * 8 * 2 * 1G = 128 GB/sec  

>Modern computers have 32-bit single precision So, if we want 3 n*n SP matrices, 
maximum n is  
3n^2 * 4 <= 2048 * 1024 * 1024

>nmax = 13377 = n

>The number of operations that a naive mm algorithm (triply nested loop) needs is calculated as follows:   
>For each element of the
 result, we need n multiply-adds For each row of the result, 

>we need n * n multiply-adds  For the entire result matrix, we need n * n * n multiply-adds Thus, approximately 2393 GFlops.  

> Assuming no cache, we have loading of 2 matrices and storing of 1 to the graphics memory.

>That is 3 * n^2 = 512 GB of data.  This process will take 512 / 128 = 4 seconds   
Also, the processing will take 2393 / 192 = 12.46 seconds   Thus the
 entire matrix multiplication will take 16.46 seconds.

现在我的问题是 - 如何计算3 *((13377)^ 2)= 536,832,387

翻译为536,832,387 = 512 GB。

这是536.8百万的价值。每个值长度为4个字节。存储器接口宽度为8个字节 - 假设GPU无法获取2个值并将其拆分 - 这有效地使读取和写入的大小翻倍。因此,使用的2GB内存被有效地读/写两次(因为8个字节被读取而4个被忽略)因此在RAM和GPU之间只传递4GB的数据。

有人可以告诉我哪里出错了,因为我能想到的唯一方法是536.8百万结果是以KB为单位的内存操作的值 - 在任何地方都没有说明。

0 个答案:

没有答案