我正在尝试实现多线程命令缓冲区的生成(使用每个线程的命令池和辅助命令缓冲区),但是使用多个线程的性能几乎没有提高。
首先,我以为我的线程池代码编写不正确,但是我尝试了Sascha Willems的线程池实现,并且没有任何改变(因此我认为这不是问题)
第二,我搜索了多线程性能问题,发现从不同线程访问相同的变量/资源会导致性能下降,但我仍然无法解决问题。
我还下载了Sascha Willems的多线程代码,然后运行它,效果很好。我修改了工作线程的数量,使用多个线程可以明显提高性能。
以下是一些渲染600个对象(相同模型)的FPS结果。您可以看到我的问题是什么
core count Sascha Willems's my result my result (avg. FPS)
result ( avg. FPS) (avg. FPS) validation layer disabled
1 45 30 55
2 83 33 72
4 110 40 84
6 155 42 103
8 162 42 104
10 173 40 111
12 175 40 119
这是我准备线程数据的地方
void prepareThreadData
{
primaryCommandPool = m_device.createCommandPool (
vk::CommandPoolCreateInfo (
vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
graphicsQueueIdx
)
);
primaryCommandBuffer = m_device.allocateCommandBuffers (
vk::CommandBufferAllocateInfo (
primaryCommandPool,
vk::CommandBufferLevel::ePrimary,
1
)
)[0];
threadData.resize(numberOfThreads);
for (int i = 0; i < numberOfThreads; ++i)
{
threadData[i].commandPool = m_device.createCommandPool (
vk::CommandPoolCreateInfo (
vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
graphicsQueueIdx
)
);
threadData[i].commandBuffer = m_device.allocateCommandBuffers (
vk::CommandBufferAllocateInfo (
threadData[i].commandPool,
vk::CommandBufferLevel::eSecondary,
numberOfObjectsPerThread
)
);
for (int j = 0; j < numberOfObjectsPerThread; ++j)
{
VertexPushConstant pushConstant = { someRandomPosition()};
threadData[i].pushConstBlock.push_back(pushConstant);
}
}
}
这是我的渲染循环代码,我在其中为每个线程分配工作:
while (!display.IsWindowClosed())
{
display.PollEvents();
m_device.acquireNextImageKHR(m_swapChain, std::numeric_limits<uint64_t>::max(), presentCompleteSemaphore, nullptr, ¤tBuffer);
primaryCommandBuffer.begin(vk::CommandBufferBeginInfo());
primaryCommandBuffer.beginRenderPass(
vk::RenderPassBeginInfo(m_renderPass, m_swapChainBuffers[currentBuffer].frameBuffer, m_renderArea, clearValues.size(), clearValues.data()),
vk::SubpassContents::eSecondaryCommandBuffers);
vk::CommandBufferInheritanceInfo inheritanceInfo = {};
inheritanceInfo.renderPass = m_renderPass;
inheritanceInfo.framebuffer = m_swapChainBuffers[currentBuffer].frameBuffer;
for (int t = 0; t < numberOfThreads; ++t)
{
for (int i = 0; i < numberOfObjectsPerThread; ++i)
{
threadPool.threads[t]->addJob([=]
{
std::array<vk::DeviceSize, 1> offsets = { 0 };
vk::Viewport viewport = vk::Viewport(0.0f, 0.0f, WIDTH, HEIGHT, 0.0f, 1.0f);
vk::Rect2D renderArea = vk::Rect2D(vk::Offset2D(), vk::Extent2D(WIDTH, HEIGHT));
threadData[t].commandBuffer[i].begin(vk::CommandBufferBeginInfo(vk::CommandBufferUsageFlagBits::eRenderPassContinue, &inheritanceInfo));
threadData[t].commandBuffer[i].setViewport(0, viewport);
threadData[t].commandBuffer[i].setScissor(0, renderArea);
threadData[t].commandBuffer[i].bindPipeline(vk::PipelineBindPoint::eGraphics, m_graphicsPipeline);
threadData[t].commandBuffer[i].bindVertexBuffers(VERTEX_BUFFER_BIND, 1, &model.vertexBuffer, offsets.data());
threadData[t].commandBuffer[i].bindIndexBuffer(model.indexBuffer, 0, vk::IndexType::eUint32);
threadData[t].commandBuffer[i].pushConstants(pipelineLayout, vk::ShaderStageFlagBits::eVertex, 0, sizeof(VertexPushConstant), &threadData[t].pushConstBlock[i]);
threadData[t].commandBuffer[i].drawIndexed(model.indexCount, 1, 0, 0, 0);
threadData[t].commandBuffer[i].end();
});
}
}
threadPool.wait();
std::vector<vk::CommandBuffer> commandBuffers;
for (int t = 0; t < numberOfThreads; ++t)
{
for (int i = 0; i < numberOfObjectsPerThread; ++i)
{
commandBuffers.push_back(threadData[t].commandBuffer[i]);
}
}
primaryCommandBuffer.executeCommands(commandBuffers.size(), commandBuffers.data());
primaryCommandBuffer.endRenderPass();
primaryCommandBuffer.end();
submitQueue(presentCompleteSemaphore, primaryCommandBuffer);
}
如果您对我想念的是什么/我做错了什么有任何想法,请告诉我。
Here是完整的VS 2017项目,如果有人想玩:D
我知道这是一个MESS,但是我只是在学习Vulkan。
答案 0 :(得分:1)
似乎我找到了问题:我未启用验证层。我禁用了它,并且性能提高了很多,我在问题表中更新了第4行进行比较。谁知道验证层会消耗大量的运行时间。 如果有人想衡量Vulkan的性能,别忘了禁用它!