时间:2017-12-02 04:37:31

标签: swift cocoa xcode9 metal metalkit

我正在使用swift 4.0 MetalKit API为macOS 10.13设计一个Cocoa应用程序。我在这里报告的所有内容都是在我的2015 MBPro上完成的。

我已经成功实现了一个MTKView,它可以很好地渲染具有低顶点数的简单几何体(立方体,三角形等)。我实现了一个基于鼠标拖动的相机,它可以旋转,扫描和放大。以下是旋转立方体时xcode FPS调试屏幕的屏幕截图:

enter image description here

然而,当我尝试加载一个只包含约1500个顶点的数据集(每个顶点存储为7 x 32位Floats ...即总共42 kB)时,我开始在FPS中遇到非常糟糕的延迟。我将展示代码实现更低。这是一个截图(请注意,在此图像上,视图仅包含一些顶点,这些顶点呈现为大点):

enter image description here

这是我的实施:

1)viewDidLoad():

override func viewDidLoad() {

    super.viewDidLoad()

    // Initialization of the projection matrix and camera
    self.projectionMatrix = float4x4.makePerspectiveViewAngle(float4x4.degrees(toRad: 85.0),
                                      aspectRatio: Float(self.view.bounds.size.width / self.view.bounds.size.height),
                                      nearZ: 0.01, farZ: 100.0)
    self.vCam = ViewCamera()

    // Initialization of the MTLDevice
    metalView.device = MTLCreateSystemDefaultDevice()
    device = metalView.device
    metalView.colorPixelFormat = .bgra8Unorm

    // Initialization of the shader library
    let defaultLibrary = device.makeDefaultLibrary()!
    let fragmentProgram = defaultLibrary.makeFunction(name: "basic_fragment")
    let vertexProgram = defaultLibrary.makeFunction(name: "basic_vertex")

    // Initialization of the MTLRenderPipelineState
    let pipelineStateDescriptor = MTLRenderPipelineDescriptor()
    pipelineStateDescriptor.vertexFunction = vertexProgram
    pipelineStateDescriptor.fragmentFunction = fragmentProgram
    pipelineStateDescriptor.colorAttachments[0].pixelFormat = .bgra8Unorm
    pipelineState = try! device.makeRenderPipelineState(descriptor: pipelineStateDescriptor)

    // Initialization of the MTLCommandQueue
    commandQueue = device.makeCommandQueue()

    // Initialization of Delegates and BufferProvider for View and Projection matrix MTLBuffer
    self.metalView.delegate = self
    self.metalView.eventDelegate = self
    self.bufferProvider = BufferProvider(device: device, inflightBuffersCount: 3, sizeOfUniformsBuffer: MemoryLayout<Float>.size * float4x4.numberOfElements() * 2)
}

2)加载立方体顶点的MTLBuffer:

private func makeCubeVertexBuffer() {

    let cube = Cube()
    let vertices = cube.verticesArray
    var vertexData = Array<Float>()
    for vertex in vertices{
        vertexData += vertex.floatBuffer()
    }
    VDataSize = vertexData.count * MemoryLayout.size(ofValue: vertexData[0])
    self.vertexBuffer = device.makeBuffer(bytes: vertexData, length: VDataSize!, options: [])!
    self.vertexCount = vertices.count
}

3)加载数据集顶点的MTLBuffer。请注意,我明确地将此缓冲区的存储模式声明为Private,以确保GPU有效地访问数据,因为一旦加载缓冲区,CPU就不需要访问数据。另外,请注意我在实际数据集中只加载了1/100的顶点,因为当我尝试完全加载它时,我的机器上的整个操作系统都开始滞后(只有4.2 MB的数据)。

public func loadDataset(datasetVolume: DatasetVolume) {

    // Load dataset vertices
    self.datasetVolume = datasetVolume
    self.datasetVertexCount = self.datasetVolume!.vertexCount/100
    let rgbaVertices = self.datasetVolume!.rgbaPixelVolume[0...(self.datasetVertexCount!-1)]
    var vertexData = Array<Float>()
    for vertex in rgbaVertices{
            vertexData += vertex.floatBuffer()
    }
    let dataSize = vertexData.count * MemoryLayout.size(ofValue: vertexData[0])

    // Make two MTLBuffer's: One with Shared storage mode in which data is initially loaded, and a second one with Private storage mode
    self.datasetVertexBuffer = device.makeBuffer(bytes: vertexData, length: dataSize, options: MTLResourceOptions.storageModeShared)
    self.datasetVertexBufferGPU = device.makeBuffer(length: dataSize, options: MTLResourceOptions.storageModePrivate)

    // Create a MTLCommandBuffer and blit the vertex data from the Shared MTLBuffer to the Private MTLBuffer
    let commandBuffer = self.commandQueue.makeCommandBuffer()
    let blitEncoder = commandBuffer!.makeBlitCommandEncoder()
    blitEncoder!.copy(from: self.datasetVertexBuffer!, sourceOffset: 0, to: self.datasetVertexBufferGPU!, destinationOffset: 0, size: dataSize)
    blitEncoder!.endEncoding()
    commandBuffer!.commit()

    // Clean up
    self.datasetLoaded = true
    self.datasetVertexBuffer = nil
}

4)最后,这是渲染循环。同样,这是使用MetalKit。

func draw(in view: MTKView) {
    render(view.currentDrawable)
}

private func render(_ drawable: CAMetalDrawable?) {
    guard let drawable = drawable else { return }

    // Make sure an MTLBuffer for the View and Projection matrices is available
    _ = self.bufferProvider?.availableResourcesSemaphore.wait(timeout: DispatchTime.distantFuture)

    // Initialize common RenderPassDescriptor
    let renderPassDescriptor = MTLRenderPassDescriptor()
    renderPassDescriptor.colorAttachments[0].texture = drawable.texture
    renderPassDescriptor.colorAttachments[0].loadAction = .clear
    renderPassDescriptor.colorAttachments[0].clearColor = Colors.White
    renderPassDescriptor.colorAttachments[0].storeAction = .store

    // Initialize a CommandBuffer and add a CompletedHandler to release an MTLBuffer from the BufferProvider once the GPU is done processing this command
    let commandBuffer = self.commandQueue.makeCommandBuffer()
    commandBuffer?.addCompletedHandler { (_) in
        self.bufferProvider?.availableResourcesSemaphore.signal()
    }

    // Update the View matrix and obtain an MTLBuffer for it and the projection matrix
    let camViewMatrix = self.vCam.getLookAtMatrix()
    let uniformBuffer = bufferProvider?.nextUniformsBuffer(projectionMatrix: projectionMatrix, camViewMatrix: camViewMatrix)

    // Initialize a MTLParallelRenderCommandEncoder
    let parallelEncoder = commandBuffer?.makeParallelRenderCommandEncoder(descriptor: renderPassDescriptor)

    // Create a CommandEncoder for the cube vertices if its data is loaded
    if self.cubeLoaded == true {
        let cubeRenderEncoder = parallelEncoder?.makeRenderCommandEncoder()
        cubeRenderEncoder!.setCullMode(MTLCullMode.front)
        cubeRenderEncoder!.setRenderPipelineState(pipelineState)
        cubeRenderEncoder!.setTriangleFillMode(MTLTriangleFillMode.fill)
        cubeRenderEncoder!.setVertexBuffer(self.cubeVertexBuffer, offset: 0, index: 0)
        cubeRenderEncoder!.setVertexBuffer(uniformBuffer, offset: 0, index: 1)
        cubeRenderEncoder!.drawPrimitives(type: .triangle, vertexStart: 0, vertexCount: vertexCount!, instanceCount: self.cubeVertexCount!/3)
        cubeRenderEncoder!.endEncoding()
    }

    // Create a CommandEncoder for the dataset vertices if its data is loaded
    if self.datasetLoaded == true {
        let rgbaVolumeRenderEncoder = parallelEncoder?.makeRenderCommandEncoder()
        rgbaVolumeRenderEncoder!.setRenderPipelineState(pipelineState)
        rgbaVolumeRenderEncoder!.setVertexBuffer( self.datasetVertexBufferGPU!, offset: 0, index: 0)
        rgbaVolumeRenderEncoder!.setVertexBuffer(uniformBuffer, offset: 0, index: 1)
        rgbaVolumeRenderEncoder!.drawPrimitives(type: .point, vertexStart: 0, vertexCount: datasetVertexCount!, instanceCount: datasetVertexCount!)
        rgbaVolumeRenderEncoder!.endEncoding()
    }

    // End CommandBuffer encoding and commit task
    parallelEncoder!.endEncoding()
    commandBuffer!.present(drawable)
    commandBuffer!.commit()
}

好吧,所以这些是我试图弄清楚导致滞后的原因,记住滞后效应与数据集顶点缓冲区的大小成正比:

  1. 我最初认为这是因为GPU无法足够快地访问内存,因为它处于共享存储模式 - &gt;我将数据集MTLBuffer更改为私有存储模式。这并没有解决问题。

  2. 然后,我认为问题是由于CPU在我的render()函数中花费了太多时间。这可能是由于BufferProvider的问题,或者是因为某种程度上CPU试图以某种方式每帧重新处理/重新加载数据集顶点缓冲区 - &gt;为了检查这一点,我在xcode的Instruments中使用了Time Profiler。不幸的是,似乎问题是应用程序很少使用这种渲染方法(换句话说,MTKView的draw()方法)。以下是一些截图:

  3. enter image description here

    • 加载立方体时~10秒的尖峰
    • 〜25-35秒之间的峰值是加载数据集的时间

    enter image description here

    • 此图像(^)显示加载立方体后约10-20秒的活动。这是当FPS在~60时。您可以看到主线程在这10秒内在render()函数中花费大约53ms。

    enter image description here

    • 此图像(^)显示加载数据集后约40-50秒的活动。这是当FPS <1时。 10.您可以看到主线程在这10秒内在render()函数中花费大约4ms。正如您所看到的,通常在此函数中调用的方法都不会被调用(即:仅在加载多维数据集时我们可以看到的方法,前一个图像)。值得注意的是,当我加载数据集时,时间分析器的计时器开始跳转(即:它停止几秒钟,然后跳转到当前时间......重复)。

    所以我就是这样。问题似乎是CPU在某种程度上以递归的方式过载了这些42 kB的数据。我还在xcode的Instruments中使用Allocator进行了测试。据我所知,没有内存泄漏的迹象(你可能已经注意到这很多对我来说很新)。

    对于这个错综复杂的帖子感到抱歉,我希望不要太难以理解。提前感谢大家的帮助。

    修改

    以下是我的着色器,以防您希望看到它们:

    struct VertexIn{
        packed_float3 position;
        packed_float4 color;
    };
    
    struct VertexOut{
        float4 position [[position]];  
        float4 color;
        float  size [[point_size]];
    };
    
    struct Uniforms{
        float4x4 cameraMatrix;
        float4x4 projectionMatrix;
    };
    
    
    vertex VertexOut basic_vertex(const device VertexIn* vertex_array [[ buffer(0) ]],
                                  constant Uniforms&  uniforms    [[ buffer(1) ]],
                                  unsigned int vid [[ vertex_id ]]) {
    
        float4x4 cam_Matrix = uniforms.cameraMatrix;
        float4x4 proj_Matrix = uniforms.projectionMatrix;
    
        VertexIn VertexIn = vertex_array[vid];
    
        VertexOut VertexOut;
        VertexOut.position = proj_Matrix * cam_Matrix * float4(VertexIn.position,1);
        VertexOut.color = VertexIn.color;
        VertexOut.size = 15;
    
        return VertexOut;
    }
    
    fragment half4 basic_fragment(VertexOut interpolated [[stage_in]]) {
        return half4(interpolated.color[0], interpolated.color[1], interpolated.color[2], interpolated.color[3]);
    }
    

1 个答案:

答案 0 :(得分:2)

我认为主要的问题是你告诉Metal做你不应该做的实例化绘图。这一行:

rgbaVolumeRenderEncoder!.drawPrimitives(type: .point, vertexStart: 0, vertexCount: datasetVertexCount!, instanceCount: datasetVertexCount!)

告诉Metal绘制每个datasetVertexCount!个顶点的datasetVertexCount!个实例。 GPU的工作量随着顶点数的平方而增长。此外,由于您没有使用实例ID,例如,调整顶点位置,所有这些实例都是相同的,因此是多余的。

我认为这同样适用于这一行:

cubeRenderEncoder!.drawPrimitives(type: .triangle, vertexStart: 0, vertexCount: vertexCount!, instanceCount: self.cubeVertexCount!/3)

虽然不清楚self.cubeVertexCount!是什么以及它是否随着vertexCount而增长。在任何情况下,由于您似乎使用相同的管道状态,因此使用不使用实例ID的相同着色器,它仍然无用且浪费。

其他事项:

当您实际上没有使用它启用的并行性时,为什么使用MTLParallelRenderCommandEncoder?不要那样做。

您使用size MemoryLayout方法的所有地方,您几乎肯定会使用stride。如果您正在计算复合数据结构的步幅,请采取该结构的一个元素的步幅并乘以元素的数量。采取整个数据结构的步伐。