Question

我确定之前已经回答过，但我找不到合适的解释。

我正在编写一个图形程序，其中部分管道正在将体素数据复制到OpenCL页锁定（固定）内存。我发现这个复制过程是一个瓶颈，并对一个简单的std::copy的性能进行了一些测量。数据是浮点数，我要复制的每个数据块大小约为64 MB。

这是我的原始代码，在任何基准测试尝试之前：

std::copy(data, data+numVoxels, pinnedPointer_[_index]);

其中data是浮点指针，numVoxels是无符号整数，pinnedPointer_[_index]是引用固定OpenCL缓冲区的浮点指针。

由于我的性能很慢，我决定尝试复制较小的数据部分，看看我得到了什么样的带宽。我使用boost :: cpu_timer进行计时。我试过运行它一段时间以及平均运行了几百次，得到了类似的结果。以下是相关代码以及结果：

boost::timer::cpu_timer t;                                                    
unsigned int testNum = numVoxels;                                             
while (testNum > 2) {                                                         
  t.start();                                                                  
  std::copy(data, data+testNum, pinnedPointer_[_index]);                      
  t.stop();                                                                   
  boost::timer::cpu_times result = t.elapsed();                               
  double time = (double)result.wall / 1.0e9 ;                                 
  int size = testNum*sizeof(float);                                           
  double GB = (double)size / 1073741842.0;                                    
  // Print results  
  testNum /= 2;                                                               
}

Copied 67108864 bytes in 0.032683s, 1.912315 GB/s
Copied 33554432 bytes in 0.017193s, 1.817568 GB/s
Copied 16777216 bytes in 0.008586s, 1.819749 GB/s
Copied 8388608 bytes in 0.004227s, 1.848218 GB/s
Copied 4194304 bytes in 0.001886s, 2.071705 GB/s
Copied 2097152 bytes in 0.000819s, 2.383543 GB/s
Copied 1048576 bytes in 0.000290s, 3.366923 GB/s
Copied 524288 bytes in 0.000063s, 7.776913 GB/s
Copied 262144 bytes in 0.000016s, 15.741867 GB/s
Copied 131072 bytes in 0.000008s, 15.213149 GB/s
Copied 65536 bytes in 0.000004s, 14.374742 GB/s
Copied 32768 bytes in 0.000003s, 10.209962 GB/s
Copied 16384 bytes in 0.000001s, 10.344942 GB/s
Copied 8192 bytes in 0.000001s, 6.476566 GB/s
Copied 4096 bytes in 0.000001s, 4.999603 GB/s
Copied 2048 bytes in 0.000001s, 1.592111 GB/s
Copied 1024 bytes in 0.000001s, 1.600125 GB/s
Copied 512 bytes in 0.000001s, 0.843960 GB/s
Copied 256 bytes in 0.000001s, 0.210990 GB/s
Copied 128 bytes in 0.000001s, 0.098439 GB/s
Copied 64 bytes in 0.000001s, 0.049795 GB/s
Copied 32 bytes in 0.000001s, 0.049837 GB/s
Copied 16 bytes in 0.000001s, 0.023728 GB/s

复制块为65536-262144字节时有明显的带宽峰值，带宽远高于复制整个阵列（15 vs 2 GB / s）。

知道这一点，我决定尝试另一件事并复制整个数组，但是使用重复调用std::copy，其中每个调用只处理数组的一部分。尝试不同的块大小，这些是我的结果：

unsigned int testNum = numVoxels;                                             
unsigned int parts = 1;                                                       
while (sizeof(float)*testNum > 256) {                                         
  t.start();                                                                  
  for (unsigned int i=0; i<parts; ++i) {                                      
    std::copy(data+i*testNum, 
              data+(i+1)*testNum, 
              pinnedPointer_[_index]+i*testNum);
  }                                                                           
  t.stop();                                                                   
  boost::timer::cpu_times result = t.elapsed();                               
  double time = (double)result.wall / 1.0e9;                                  
  int size = testNum*sizeof(float);                                           
  double GB = parts*(double)size / 1073741824.0;                              
  // Print results
  parts *= 2;                                                                 
  testNum /= 2;                                                               
}      

Part size 67108864 bytes, copied 0.0625 GB in 0.0331298s, 1.88652 GB/s
Part size 33554432 bytes, copied 0.0625 GB in 0.0339876s, 1.83891 GB/s
Part size 16777216 bytes, copied 0.0625 GB in 0.0342558s, 1.82451 GB/s
Part size 8388608 bytes, copied 0.0625 GB in 0.0334264s, 1.86978 GB/s
Part size 4194304 bytes, copied 0.0625 GB in 0.0287896s, 2.17092 GB/s
Part size 2097152 bytes, copied 0.0625 GB in 0.0289941s, 2.15561 GB/s
Part size 1048576 bytes, copied 0.0625 GB in 0.0240215s, 2.60184 GB/s
Part size 524288 bytes, copied 0.0625 GB in 0.0184499s, 3.38756 GB/s
Part size 262144 bytes, copied 0.0625 GB in 0.0186002s, 3.36018 GB/s
Part size 131072 bytes, copied 0.0625 GB in 0.0185958s, 3.36097 GB/s
Part size 65536 bytes, copied 0.0625 GB in 0.0185735s, 3.365 GB/s
Part size 32768 bytes, copied 0.0625 GB in 0.0186523s, 3.35079 GB/s
Part size 16384 bytes, copied 0.0625 GB in 0.0187756s, 3.32879 GB/s
Part size 8192 bytes, copied 0.0625 GB in 0.0182212s, 3.43007 GB/s
Part size 4096 bytes, copied 0.0625 GB in 0.01825s, 3.42465 GB/s
Part size 2048 bytes, copied 0.0625 GB in 0.0181881s, 3.43631 GB/s
Part size 1024 bytes, copied 0.0625 GB in 0.0180842s, 3.45605 GB/s
Part size 512 bytes, copied 0.0625 GB in 0.0186669s, 3.34817 GB/s

看起来减少块大小实际上会产生很大的影响，但我仍然无法达到15 GB / s的速度。

我运行64位Ubuntu，GCC优化并没有太大的区别。

为什么数组大小会以这种方式影响带宽？
OpenCL固定内存是否起作用？
优化大型数组副本的策略是什么？

Answer 1

我很确定你遇到了缓存抖动问题。如果用你编写的数据填充缓存，下一次需要一些数据，缓存必须从内存中读取数据，但首先需要在缓存中找到一些空间 - 因为所有数据[或者至少很多它是“脏的”，因为它已被写入，它需要写入RAM。接下来，我们向缓存写入一个新的数据位，这会丢弃另一个脏的数据（或者我们之前读过的东西）。

在汇编程序中，我们可以通过使用“非时间”移动指令来克服这个问题。例如，SSE指令movntps。该指令将“避免将内容存储在缓存中”。

编辑：你也可以通过不混合读写来获得更好的性能 - 使用4-16KB的小缓冲区[固定大小数组]，并将数据复制到该缓冲区，然后将该缓冲区写入您所在的新位置想要它。同样，理想情况下使用非时间写入，因为即使在这种情况下，这也会提高吞吐量 - 但只使用“块”来读取然后写入，而不是读取一个，写入一个，将会更快。

这样的事情：

   float temp[2048]; 
   int left_to_do = numVoxels;
   int offset = 0;

   while(left_to_do)
   {
      int block = min(left_to_do, sizeof(temp)/sizeof(temp[0]); 
      std::copy(data+offset, data+offset+block, temp);                      
      std::copy(temp, temp+block, pinnedPointer_[_index+offet]);                      
      offset += block;
      left_to_do -= block;
   }

试试看，看看它是否有所改善。它可能不会......

Edit2：我应该解释一下这个更快，因为你重新使用相同的缓存来加载数据，并且不混合读写，我们从内存本身获得更好的性能。

数组大小和副本性能

1 个答案: