目前我的OpenGL ES程序存在性能瓶颈。我认为它会运行良好 - 使用VBO,textureatlas,每个绘制调用的几个绑定等等。但是当同时使用许多精灵时,性能会下降很多。我发现瓶颈是CPU限制的(有点惊讶)。更准确地说 - 瓶颈可以通过一种方法来计算每个矩形四个垂直点 - x1,y1,x2,y2,x3,y3,x4,y4的屏幕位置。这用于碰撞检测。我在这个方法中做的是重复在着色器中完成的操作,我认为许多cpu-cycle是由MV乘法引起的。
Matrix.multiplyMV(resultVec, 0, mModelMatrix, 0, rhsVec, 0);
rhsVec是一个浮点数组,它存储如上所述的顶点。
因为这似乎是瓶颈,我想知道如何在例如计算剪辑坐标时访问着色器中的相同矢量?剪切坐标甚至更好的是他在着色管中进一步向下着色的坐标。
顶点着色器
uniform mat4 u_MVPMatrix;
uniform mat4 u_MVMatrix;
varying vec2 v_TexCoordinate;
attribute vec4 position;
void main()
{
v_TexCoordinate = a_TexCoordinate
gl_Position = u_MVPMatrix * a_Position;
}
onSurfaceCreated的片段
final int vertexShaderHandle = ShaderHelper.compileShader(GLES20.GL_VERTEX_SHADER, vertexShader);
final int fragmentShaderHandle = ShaderHelper.compileShader(GLES20.GL_FRAGMENT_SHADER, fragmentShader);
mProgramHandle = ShaderHelper.createAndLinkProgram(vertexShaderHandle, fragmentShaderHandle,
new String[] {"a_Position", "a_Color", "a_Normal", "a_TexCoordinate"});
textureHandle = TextureHelper.loadTexture(context);
GLES20.glUseProgram(mProgramHandle);
mMVPMatrixHandle = GLES20.glGetUniformLocation(mProgramHandle, "u_MVPMatrix");
mMVMatrixHandle = GLES20.glGetUniformLocation(mProgramHandle, "u_MVMatrix");
//mColorHandle = GLES20.glGetAttribLocation(mProgramHandle, "a_Color");
mTextureCoordinateHandle = GLES20.glGetAttribLocation(mProgramHandle, "a_TexCoordinate");
mPositionHandle = GLES20.glGetAttribLocation(mProgramHandle, "a_Position");
进行顶点变换的方法(瓶颈)
private void calcPos(int index) {
int k = 0;
for (int i = 0; i < 18; i += 3) {
rhsVec[0] = vertices[0 + i];
rhsVec[1] = vertices[1 + i];
rhsVec[2] = vertices[2 + i];
rhsVec[3] = 1;
// *** Step 1 : Getting to eye coordinates ***
Matrix.multiplyMV(resultVec, 0, mModelMatrix, 0, rhsVec, 0);
// *** Step 2 : Getting to clip coordinates ***
float[] rhsVec2 = resultVec;
Matrix.multiplyMV(resultVec2, 0, mProjectionMatrix, 0, rhsVec2, 0);
// *** Step 3 : Getting to normalized device coordinates ***
float inv_w = 1 / resultVec2[3];
for (int j = 0; j < resultVec2.length - 1; j++) {
resultVec2[j] = inv_w * resultVec2[j];
}
float xPos = (resultVec2[0] * 0.5f + 0.5f) * game_width;
float yPos = (resultVec2[1] * 0.5f + 0.5f) * game_height;
float zPos = (1 + resultVec2[2]) * 0.5f;
SpriteData sD = spriteDataArrayList.get(index);
switch (k) {
case 0:
sD.xPos[0] = xPos;
sD.yPos[0] = yPos;
break;
case 1:
sD.xPos[2] = xPos;
sD.yPos[2] = yPos;
break;
case 2:
sD.xPos[3] = xPos;
sD.yPos[3] = yPos;
break;
case 3:
sD.xPos[1] = xPos;
sD.yPos[1] = yPos;
break;
}
k++;
if (i == 3) {
i += 9;
}
}
为每个精灵调用此方法 - 因此对于100个精灵,它重复100次。可能MV乘法会影响性能吗?
答案 0 :(得分:1)
要回答主要问题,我认为不可能从GPU中获取转换后的顶点。
优化循环的第一步。首先,当它们总是产生相同的结果时,不要在循环内反复做事。在循环之外做它。特别是功能或财产电话。
接下来,您可以将2个矩阵相乘,以便使用单个矩阵乘法按顺序应用它们的变换。虽然您似乎没有将最终结果转换回屏幕空间。
您正在复制数据,然后使用该数据而不更改它。我知道矩阵乘法可能需要4个浮点数或Vec4,但你可以写一个矩阵乘法来避免复制并填充w参数。
避免您最终无法使用的计算。
缓存结果,除非更改,否则不会重新计算。
private void calcPos(int index) {
// get only once, not every loop
SpriteData sD = spriteDataArrayList.get(index);
int[] vIndices = {0, 1, 2, 5}; // the 4 verts you want
// multiply once outside the loop, use result inside loop
Matrix mvpMatrix = mModelMatrix * mProjectionMatrix; // check order
for (int i = 0; i < 4; ++i) { // only grab verts you want, no need for fancy skips
int nVert = 3 * vIndices[i]; // 3 floats per vert
// should avoid copying data when you aren't going to change the copy
rhsVec[0] = vertices[0 + nVert];
rhsVec[1] = vertices[1 + nVert];
rhsVec[2] = vertices[2 + nVert];
rhsVec[3] = 1; // need to write multiplyMV3 that takes pointer to 3 floats
// and fills in the w param, then no need to copy
// E.g. :
// Matrix.multiplyMV3(resultVec2, 0, mvpMatrix, 0, &vertices[nVert], 0);
// do both matrix multiplcations at same time
Matrix.multiplyMV(resultVec2, 0, mvpMatrix, 0, rhsVec, 0);
// *** Step 3 : Getting to normalized device coordinates ***
float inv_w = 1 / resultVec2[3];
for (int j = 0; j < 2; ++j) // just what we need
resultVec2[j] *= inv_w;
// Curious... Transform into projection space, just to transform
// back into screen space. Perhaps you are transforming too far?
float xPos = (resultVec2[0] * 0.5f + 0.5f) * game_width;
float yPos = (resultVec2[1] * 0.5f + 0.5f) * game_height;
// float zPos = (1 + resultVec2[2]) * 0.5f; // not used
switch (i) {
case 0:
sD.xPos[0] = xPos;
sD.yPos[0] = yPos;
break;
case 1:
sD.xPos[2] = xPos;
sD.yPos[2] = yPos;
break;
case 2:
sD.xPos[3] = xPos;
sD.yPos[3] = yPos;
break;
case 3:
sD.xPos[1] = xPos;
sD.yPos[1] = yPos;
break;
}
}