Question

我非常想最大限度地提高性能并最大程度地减少内存占用。我正在编写GLSL顶点着色器（如果有人不确定，则以OpenGL 3.0为目标，GLSL 1.3是等效的GLSL版本），目的是尽可能以最经济的方式实现蒙皮。

我已经读到OpenGL保证至少有16KiB的均匀缓冲内存（对于我想做的事情很多，在这里不用担心）。因此，我决定通过统一的寄存器为蒙皮的顶点着色器提供骨骼变换。

我将用一些代码说明这一点：

#version 130

#define NUMBONES 16 /* 16 mat4x4 = 1KiB */
#define FLOATMAXINT 16777216

/* To map from model-space to world-space */
uniform mat4x4 u_mapWorld;

/* To map from model-space straight to screen space(?).
   = projection * view * world (using column-major btw) */
uniform mat4x4 u_mapProj; /* 'Projection' */

/* To map from model-space to light space.
   I use shadow-mapping. */
uniform mat4x4 u_mapLight;

/* For transforming normals */
uniform mat3x3 u_mapNorm;

/* Array of bone transforms for armature animation (skinning).
   This array should total to 1KiB in width? */
uniform mat4x4 u_mapBones [NUMBONES];

/* Array of (corresponding) bone transforms for mapping vertex normals.
   Probably 1KiB in width, in order to honour alignment requirements.
   Desirable width would be 0.56KiB, though. */
uniform mat3x3 u_mapBoneNorms [NUMBONES];

/* Width of two arrays = 2KiB, optimised for speed.
   Fine for what I want to do. */

我将矩阵计算委托给CPU，以减轻GPU的矩阵求逆（用于计算顶点法线变换）的负担，特别是因为这一步骤严格来说不是必需的和最好避免。我不介意这会消耗更多的内存，如果这意味着更快的顶点处理。

接下来，我有顶点输入，如下所示：

in vec3 v_pos;
in vec3 v_norm;
in vec2 v_uv;
in float v_weights; /* Actually four weights */
in float v_indices; /* Actually four indices */

我计划支持GL版本低至2.0，并可能将其扩展到1.4，这就是v_weights和v_indices是浮点数而不是打包整数的原因。低于3.0的版本被指定为缺少对整数输入的支持（以及glVertexAttribIPointer过程）。我阅读（并留下深刻的印象，要学习）单精度（32位）IEEE浮点数可以可靠地表示高达16,777,216范围内的整数，而不会影响精度。这是24位可靠的索引容量，我选择在四个权重之间进行分配（24/4 = 6），从而为每个权重索引提供6位的宽度（索引范围从0到63）。

我的程序通过构造一个压缩整数将权重索引传递给OpenGL：

std::uint32_t ui_indices =
   ((w0 & 0x3F) <<  0) + ((w1 & 0x3F) <<  6) +
   ((w2 & 0x3F) << 12) + ((w3 & 0x3F) << 18);

然后将其转换为浮点数（转换，您知道我的意思），并相信不会丢失精度：

float f_indices = static_cast<float>(ui_indices);

然后通过glVertexAttribPointer()传递到OpenGL。

然后在我的顶点着色器中，有以下代码将解压缩索引：


/* NONE OF THIS CODE IS YET TESTED */

uint ii; /* Int-converted Index */
uint i[4]; /* Unpacked indices (can we use uint8?) */

/* Extract indices */
fmod(v_indices, ii);

/* Unpack indices.
   I've kept shift-by-zero for presentation. */
i[0] = (ii>> 0) & 0x3F;
i[1] = (ii>> 6) & 0x3F;
i[2] = (ii>>12) & 0x3F;
i[3] = (ii>>18) & 0x3F;

最后，我的问题：

假设概念/实现没有缺陷（尚无测试）-GLSL会很好地优化此语句吗？此外，最好将屏蔽实现为一系列的移位（而不是位旋转），例如：

i[0] = (ii<<18)>>18;
i[1] = (ii<<12)>>18;
i[2] = (ii<< 6)>>18;
i[3] = (ii<< 0)>>18;

P.S。抱歉，如果这个问题在其他地方得到回答，我错过了。如果我为一个小问题提供了太多背景信息，也表示抱歉。

使用GLSL 1.3在GPU上优化按位运算

0 个答案: