我尝试使用Metal创建程序游戏,并且我使用基于八叉树的块方法来实现细节级别。
我使用的方法涉及CPU为地形创建八叉树节点,然后使用计算着色器在GPU上创建其网格。此网格存储在块对象中的顶点缓冲区和索引缓冲区中以进行渲染。
所有这些似乎都运行得相当不错,但是在渲染块时我很早就遇到了性能问题。目前,我收集了一组要绘制的块,然后将其提交给我的渲染器,这将创建一个MTLParallelRenderCommandEncoder
,然后为每个块创建一个MTLRenderCommandEncoder
,然后将其提交给GPU。
从它的外观来看,大约50%的CPU时间用于为每个块创建MTLRenderCommandEncoder
。目前我只是为每个块创建一个简单的8顶点立方体网格,我有一个4x4x4阵列的块,我在这些早期阶段下降到50fps左右。 (实际上似乎每个MTLRenderCommandEncoder
最多只能有MTLParallelRenderCommandEncoder
MTLParallelRenderCommandEncoder
,所以它不完全是4x4x4)
我已经读到MTLRenderCommandEncoder
的意思是在一个单独的帖子中创建每个memcpy()
,但是我没有太多运气让它发挥作用。同样多线程,它不会绕过63个块的上限被渲染为最大值。
我觉得以某种方式将每个块的顶点和索引缓冲区合并到一个或两个更大的缓冲区中以提交将有所帮助,但我不知道如何在没有大量func drawNodes(nodes: [OctreeNode], inView view: AHMetalView){
// For control of several rotating buffers
dispatch_semaphore_wait(displaySemaphore, DISPATCH_TIME_FOREVER)
makeDepthTexture()
updateUniformsForView(view, duration: view.frameDuration)
let commandBuffer = commandQueue.commandBuffer()
let optDrawable = layer.nextDrawable()
guard let drawable = optDrawable else{
return
}
let passDescriptor = MTLRenderPassDescriptor()
passDescriptor.colorAttachments[0].texture = drawable.texture
passDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.2, 0.2, 0.2, 1)
passDescriptor.colorAttachments[0].storeAction = .Store
passDescriptor.colorAttachments[0].loadAction = .Clear
passDescriptor.depthAttachment.texture = depthTexture
passDescriptor.depthAttachment.clearDepth = 1
passDescriptor.depthAttachment.loadAction = .Clear
passDescriptor.depthAttachment.storeAction = .Store
let parallelRenderPass = commandBuffer.parallelRenderCommandEncoderWithDescriptor(passDescriptor)
// Currently 63 nodes as a maximum
for node in nodes{
// This line is taking up around 50% of the CPU time
let renderPass = parallelRenderPass.renderCommandEncoder()
renderPass.setRenderPipelineState(renderPipelineState)
renderPass.setDepthStencilState(depthStencilState)
renderPass.setFrontFacingWinding(.CounterClockwise)
renderPass.setCullMode(.Back)
let uniformBufferOffset = sizeof(AHUniforms) * uniformBufferIndex
renderPass.setVertexBuffer(node.vertexBuffer, offset: 0, atIndex: 0)
renderPass.setVertexBuffer(uniformBuffer, offset: uniformBufferOffset, atIndex: 1)
renderPass.setTriangleFillMode(.Lines)
renderPass.drawIndexedPrimitives(.Triangle, indexCount: AHMaxIndicesPerChunk, indexType: AHIndexType, indexBuffer: node.indexBuffer, indexBufferOffset: 0)
renderPass.endEncoding()
}
parallelRenderPass.endEncoding()
commandBuffer.presentDrawable(drawable)
commandBuffer.addCompletedHandler { (commandBuffer) -> Void in
self.uniformBufferIndex = (self.uniformBufferIndex + 1) % AHInFlightBufferCount
dispatch_semaphore_signal(self.displaySemaphore)
}
commandBuffer.commit()
}
调用的情况下执行此操作以及是否或这不会提高效率。
这是我的代码,它接收节点数组并绘制它们:
{{1}}
答案 0 :(得分:7)
你注意到:
我已经读过
MTLParallelRenderCommandEncoder
的意思是在一个单独的帖子中创建每个MTLRenderCommandEncoder
......
你是对的。你正在做的是顺序创建,编码和结束命令编码器 - 这里没有任何并行,所以MTLParallelRenderCommandEncoder
对你没有任何作用。如果你消除了并行编码器并且在每次通过for循环时只创建了renderCommandEncoderWithDescriptor(_:)
的编码器,那么你的性能大致相同......也就是说,你仍然会遇到相同的性能问题创建所有编码器的开销。
因此,如果您要按顺序编码,只需重复使用相同的编码器即可。此外,您应该尽可能多地重用其他共享状态。这是一个可能的重构(未经测试)的快速传递:
let passDescriptor = MTLRenderPassDescriptor()
// call this once before your render loop
func setup() {
makeDepthTexture()
passDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.2, 0.2, 0.2, 1)
passDescriptor.colorAttachments[0].storeAction = .Store
passDescriptor.colorAttachments[0].loadAction = .Clear
passDescriptor.depthAttachment.texture = depthTexture
passDescriptor.depthAttachment.clearDepth = 1
passDescriptor.depthAttachment.loadAction = .Clear
passDescriptor.depthAttachment.storeAction = .Store
// set up render pipeline state and depthStencil state
}
func drawNodes(nodes: [OctreeNode], inView view: AHMetalView) {
updateUniformsForView(view, duration: view.frameDuration)
// Set up completed handler ahead of time
let commandBuffer = commandQueue.commandBuffer()
commandBuffer.addCompletedHandler { _ in // unused parameter
self.uniformBufferIndex = (self.uniformBufferIndex + 1) % AHInFlightBufferCount
dispatch_semaphore_signal(self.displaySemaphore)
}
// Semaphore should be tied to drawable acquisition
dispatch_semaphore_wait(displaySemaphore, DISPATCH_TIME_FOREVER)
guard let drawable = layer.nextDrawable()
else { return }
// Set up the one part of the pass descriptor that changes per-frame
passDescriptor.colorAttachments[0].texture = drawable.texture
// Get one render pass descriptor and reuse it
let renderPass = commandBuffer.renderCommandEncoderWithDescriptor(passDescriptor)
renderPass.setTriangleFillMode(.Lines)
renderPass.setRenderPipelineState(renderPipelineState)
renderPass.setDepthStencilState(depthStencilState)
for node in nodes {
// Update offsets and draw
let uniformBufferOffset = sizeof(AHUniforms) * uniformBufferIndex
renderPass.setVertexBuffer(node.vertexBuffer, offset: 0, atIndex: 0)
renderPass.setVertexBuffer(uniformBuffer, offset: uniformBufferOffset, atIndex: 1)
renderPass.drawIndexedPrimitives(.Triangle, indexCount: AHMaxIndicesPerChunk, indexType: AHIndexType, indexBuffer: node.indexBuffer, indexBufferOffset: 0)
}
renderPass.endEncoding()
commandBuffer.presentDrawable(drawable)
commandBuffer.commit()
}
然后,使用Instruments查看您可能遇到的进一步性能问题(如果有)。有一个很棒的WWDC 2015 session显示了几个常见的“问题”,如何在分析中诊断它们,以及如何解决它们。