Question

情况： 在Metal内核函数中，线程组中的每个线程一次读取完全相同的值。内核伪代码：

kernel void foo(device   int2*   ranges,  
                constant float3& readonlyBuffer,  
                device   float*  results,  
                uint lno [[ threadgroup_position_in_grid ]])  
{  
  float acc = 0.0;  

  for(int i=ranges[lno].x; i<ranges[lno].y; i++) {  
    // each thread in threadgroup processes the same value from the buffer  
    acc += process( readonlyBuffer[i] );  
  }  

  results[...] = acc;  
}

问题：在追求优化缓冲区读取时，我将readonlyBuffer的地址空间限定符从device更改为constant。尽管Apple documentation表示不同的内容，但这对内核性能没有任何影响：

常量地址空间针对执行访问缓冲区中相同位置的图形或内核函数的多个实例进行了优化。

问题：

如何改善常量缓冲区的内存读取时间？
我可以将缓冲区（或至少部分缓冲区）移动到片上缓存（类似Constant Buffer Preloading（第24页））吗？

Answer 1

在您的示例代码中，索引到readonlyBuffer会产生编译器错误。

假设readonlyBuffer被声明为指针，则编译器不会静态地知道大小，也无法将数据移动到常量内存空间。

如果readonlyBuffer很小（您只有4KB的常量内存可供使用），请将其放入结构中，如下所示：

struct ReadonlyBuffer {
    float3 values[MAX_BUFFER_SIZE];
};

然后做：

kernel void foo(device   int2*   ranges,  
                constant ReadonlyBuffer& readonlyBuffer,  
                device   float*  results,  
                uint lno [[ threadgroup_position_in_grid ]])

最后，运行GPU跟踪（“捕获GPU帧”）并确保不会出现以下错误：

编译器无法预加载缓冲区。核函数，缓冲指数：1。

有关缓冲区预加载的详细信息，请参阅：https://developer.apple.com/videos/play/wwdc2016/606/?time=408

iOS Metal：读取只读数据的最快方法？

1 个答案: