Question

我正在尝试将项目从OpenGL迁移到iOS上的Metal。但是我似乎已经碰壁了。任务很简单...

我的纹理很大（超过3000x3000像素）。在每个touchesMoved事件上，我需要在其上绘制几个（数百个）小纹理（例如124x124）。这是在启用特定混合功能的同时。它基本上就像一个油漆刷。然后显示大的纹理。这大致就是任务。

在OpenGL上，它运行非常快。我大约达到60fps。当我将相同的代码移植到Metal时，我只能设法获得15fps。

我已经创建了两个示例项目，几乎没有演示这个问题。这是项目（OpenGL和Metal）...

https://drive.google.com/file/d/12MPt1nMzE2UL_s4oXEUoTCXYiTz42r4b/view?usp=sharing

这大概是我在OpenGL中所做的...

?dplyr::between

我在每个触摸事件中以循环方式（大约200-500）运行此代码。它运行非常快。

这就是我将代码移植到Metal的方式...

    - (void) renderBrush:(GLuint)brush on:(GLuint)fbo ofSize:(CGSize)size at:(CGPoint)point {
    GLfloat brushCoordinates[] = {
        0.0f, 0.0f,
        1.0f, 0.0f,
        0.0f,  1.0f,
        1.0f,  1.0f,
    };

    GLfloat imageVertices[] = {
        -1.0f, -1.0f,
        1.0f, -1.0f,
        -1.0f,  1.0f,
        1.0f,  1.0f,
    };

    int brushSize = 124;

    CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);

    rect.origin.x /= size.width;
    rect.origin.y /= size.height;
    rect.size.width /= size.width;
    rect.size.height /= size.height;

    [self convertImageVertices:imageVertices toProjectionRect:rect onImageOfSize:size];

    int currentFBO;
    glGetIntegerv(GL_FRAMEBUFFER_BINDING, &currentFBO);

    [_Program use];

    glBindFramebuffer(GL_FRAMEBUFFER, fbo);
    glViewport(0, 0, (int)size.width, (int)size.height);

    glActiveTexture(GL_TEXTURE2);
    glBindTexture(GL_TEXTURE_2D, brush);
    glUniform1i(brushTextureLocation, 2);

    glVertexAttribPointer(positionLocation, 2, GL_FLOAT, 0, 0, imageVertices);
    glVertexAttribPointer(brushCoordinateLocation, 2, GL_FLOAT, 0, 0, brushCoordinates);

    glEnable(GL_BLEND);
    glBlendEquation(GL_FUNC_ADD);
    glBlendFuncSeparate(GL_ONE, GL_ZERO, GL_ONE, GL_ONE);

    glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);

    glDisable(GL_BLEND);

    glActiveTexture(GL_TEXTURE2);
    glBindTexture(GL_TEXTURE_2D, 0);

    glBindFramebuffer(GL_FRAMEBUFFER, currentFBO);
}

}

然后在每个触摸事件中使用单个MTLCommandBuffer循环运行此代码，例如...

- (void) renderBrush:(id<MTLTexture>)brush onTarget:(id<MTLTexture>)target at:(CGPoint)point withCommandBuffer:(id<MTLCommandBuffer>)commandBuffer {

int brushSize = 124;

CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);

rect.origin.x /= target.width;
rect.origin.y /= target.height;
rect.size.width /= target.width;
rect.size.height /= target.height;

Float32 imageVertices[8];
// Calculate the vertices (basically the rectangle that we need to draw) on the target texture that we are going to draw
// We are not drawing on the entire target texture, only on a square around the point
[self composeImageVertices:imageVertices toProjectionRect:rect onImageOfSize:CGSizeMake(target.width, target.height)];

// We use different one vertexBuffer per pass. This is because this is run on a loop and the subsequent calls will overwrite
// The values. Other buffers also get overwritten but that is ok for now, we only need to demonstrate the performance.
id<MTLBuffer> vertexBuffer = [_vertexArray lastObject];

memcpy([vertexBuffer contents], imageVertices, 8 * sizeof(Float32));

id<MTLRenderCommandEncoder> commandEncoder = [commandBuffer renderCommandEncoderWithDescriptor:mRenderPassDescriptor];
commandEncoder.label = @"DrawCE";

[commandEncoder setRenderPipelineState:mPipelineState];

[commandEncoder setVertexBuffer:vertexBuffer offset:0 atIndex:0];
[commandEncoder setVertexBuffer:mBrushTextureBuffer offset:0 atIndex:1];

[commandEncoder setFragmentTexture:brush atIndex:0];
[commandEncoder setFragmentSamplerState:mSampleState atIndex:0];

[commandEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
[commandEncoder endEncoding];

在我所附的示例代码中，我用计时器循环替换了触摸事件，以使事情变得简单。

在iPhone 7 Plus上，使用OpenGL可获得60fps，而使用Metal则可获得15fps。可能是我在这里做错了什么吗？

Answer 1

删除所有冗余：

不要在渲染时创建缓冲区。在初始化期间分配足够的缓冲区。
不要为每个四边形创建命令编码器。
为每个四边形使用一个具有不同（正确对齐）偏移量的大顶点缓冲区。使用-setVertexBufferOffset:atIndex:仅在需要时设置偏移量，而无需更改缓冲区。
composeImageVertices:...可以通过适当的强制转换直接写入顶点缓冲区，从而避免使用memcpy。
取决于composeImageVertices:...的实际作用以及deltaX和deltaY是常量，您也许可以一次设置一次顶点缓冲区。顶点着色器可以根据需要变换顶点。您将以统一的形式（目的地和渲染目标大小，甚至是变换矩阵）传递适当的数据。
假设每次都相同，则不要每次都设置mPipelineState，mBrushTextureBuffer和mSampleState。
如果任何四边形共享相同的笔刷纹理，请将它们组合在一起并执行一个draw命令将其全部绘制。这可能需要切换到三角图元而不是三角带状图元。但是，如果执行索引绘制，则可以使用原始的重新启动哨兵在一个绘制命令中绘制多个三角形带。
如果计数不超过允许的纹理数量，您甚至可以在一个绘制命令中进行多个笔刷（31）。将所有笔刷纹理传递到片段着色器。它可以将它们作为纹理数组接收。顶点数据将包括笔刷索引，顶点着色器会将其向前传递，片段着色器将使用它来查找纹理以从数组中采样。
您可以使用实例化绘图在单个命令中绘制所有内容。绘制单个四边形的stroke个实例。在顶点着色器中，根据实例ID变换位置。您将必须以统一数据的形式传递deltaX和deltaY。笔刷索引也可以位于传入的单个缓冲区中，并且着色器可以通过实例ID在其中查找笔刷索引。
您是否考虑过使用点图元而不是四边形？这样可以减少顶点的数量，并为Metal提供可用于优化栅格化的信息。

与OpenGL相比，金属渲染要慢得多，而在大纹理上渲染小纹理

1 个答案: