Question

我刚刚写了一个旋转对象的程序。它只是使用idle函数更新变量theta。该变量用于创建旋转矩阵..然后我这样做..

gl_Position = rx * ry * rz * vPosition;

rx，ry和rz（矩阵）对于同一帧中的每个点都是相同的....但是对于对象中的每个点，它都会相乘。 ..我应该使用统一变量mat4来存储rx* ry * rz的乘法值并将其传递给着色器？...或者让着色器处理每个点的乘法？... ..哪个更快？....

Answer 1

While profiling is essential to measure how your application responds to optimizations, in general, passing a concatenated matrix to the vertex shader is desirable. This is for two reasons:

The amount of data passed from CPU to GPU is reduced. If rx, ry and rz are all 4x4 matrices, and the product of them (say rx_ry_rz = rx * ry * rz), is also a 4x4 matrix, then you will be transferring 2 less 4x4 matrices (128 bytes) as uniforms each update. If you use this shader to render 1000 objects per frame at 60hz, and the uniform updates with each object, that's 7MB+ per second of saved bandwidth. Maybe not extremely significant, but every bit helps, especially if bandwidth is your bottleneck.
The amount of work the vertex stage must do is reduced (assuming a non-trivial number of vertices). Generally the vertex stage is not a bottleneck, however, many drivers implement load balancing in their shader core allocation between stages, so reducing work in the vertex stage could give benefits in the pixel stage (for example). Again, profiling will give you a better idea of if/how this benefits performance.

The drawback is added CPU time taken to multiply the matrices. If your application's bottleneck is CPU execution, doing this could potentially slow down your application, as it will require the CPU to do more work than it did before.

Answer 2

我不会指望这个重复的乘法被优化，除非你确信自己确实发生在你关心的所有平台上。要做到这一点：

一种选择是基准测试，但可能很难将此操作很好地隔离到可靠地测量可能的差异。
我相信一些供应商提供的开发工具可以让您查看已编译着色器的汇编代码。我认为这是唯一可靠的方式，让您了解在这种情况下您的GLSL代码究竟发生了什么。

对于更大的主题，这是一个非常典型的例子。至少在我个人看来，你所拥有的是一个低效使用OpenGL的代码示例。对顶点着色器中的每个顶点进行相同的计算，至少在概念上对每个顶点执行，这不是你应该做的事情。

实际上，基于它们提供的好处，驱动程序优化以解决API的低效使用问题。如果一个高调的应用/游戏使用某些不良模式（其中许多都是！），并且它们被识别为对性能产生负面影响，则驱动程序会进行优化以解决它们，并仍然提供最佳性能。如果应用程序/游戏通常用于基准测试，则尤其如此。具有讽刺意味的是，这些优化可能会损害被认为不太重要的编写良好的软件的性能。

因此，如果有一个重要的应用程序/游戏做了你正在做的同样的事情，在这种情况下看起来很可能，很多驱动程序可能会包含优化以有效地处理它。

不过，我不会依赖它。原因既哲学又实际：

如果我在应用程序上工作，我觉得编写高效代码是我的工作。我不想编写糟糕的代码，并希望其他人碰巧优化他们的代码以弥补它。
您不能依赖应用程序运行的所有平台来包含这些类型的优化。特别是因为应用程序代码的使用寿命很长，而且这些平台可能还不存在。
即使优化已经到位，它们很可能也不会是免费的。您可能会触发驱动程序代码，最终消耗的资源超过了代码自己提供组合矩阵所需的资源。

我应该使用统一变量来减少矩阵乘法的数量吗？

2 个答案: