我刚刚写了一个旋转对象的程序。它只是使用idle函数更新变量theta。该变量用于创建旋转矩阵..然后我这样做..
gl_Position = rx * ry * rz * vPosition;
rx
,ry
和rz
(矩阵)对于同一帧中的每个点都是相同的....但是对于对象中的每个点,它都会相乘。 ..我应该使用统一变量mat4
来存储rx* ry * rz
的乘法值并将其传递给着色器?...或者让着色器处理每个点的乘法?... ..哪个更快?....
答案 0 :(得分:1)
While profiling is essential to measure how your application responds to optimizations, in general, passing a concatenated matrix to the vertex shader is desirable. This is for two reasons:
The amount of data passed from CPU to GPU is reduced. If rx
, ry
and rz
are all 4x4 matrices, and the product of them (say rx_ry_rz = rx * ry * rz
), is also a 4x4 matrix, then you will be transferring 2 less 4x4 matrices (128 bytes) as uniforms each update. If you use this shader to render 1000 objects per frame at 60hz, and the uniform updates with each object, that's 7MB+ per second of saved bandwidth. Maybe not extremely significant, but every bit helps, especially if bandwidth is your bottleneck.
The amount of work the vertex stage must do is reduced (assuming a non-trivial number of vertices). Generally the vertex stage is not a bottleneck, however, many drivers implement load balancing in their shader core allocation between stages, so reducing work in the vertex stage could give benefits in the pixel stage (for example). Again, profiling will give you a better idea of if/how this benefits performance.
The drawback is added CPU time taken to multiply the matrices. If your application's bottleneck is CPU execution, doing this could potentially slow down your application, as it will require the CPU to do more work than it did before.
答案 1 :(得分:0)
我不会指望这个重复的乘法被优化,除非你确信自己确实发生在你关心的所有平台上。要做到这一点:
对于更大的主题,这是一个非常典型的例子。至少在我个人看来,你所拥有的是一个低效使用OpenGL的代码示例。对顶点着色器中的每个顶点进行相同的计算,至少在概念上对每个顶点执行,这不是你应该做的事情。
实际上,基于它们提供的好处,驱动程序优化以解决API的低效使用问题。如果一个高调的应用/游戏使用某些不良模式(其中许多都是!),并且它们被识别为对性能产生负面影响,则驱动程序会进行优化以解决它们,并仍然提供最佳性能。如果应用程序/游戏通常用于基准测试,则尤其如此。具有讽刺意味的是,这些优化可能会损害被认为不太重要的编写良好的软件的性能。
因此,如果有一个重要的应用程序/游戏做了你正在做的同样的事情,在这种情况下看起来很可能,很多驱动程序可能会包含优化以有效地处理它。
不过,我不会依赖它。原因既哲学又实际: