写入GPU Compute Buffer的超级巨大瓶颈?

时间:2018-02-02 23:49:32

标签: c# unity3d shader hlsl

我在Unity中创建了一个计算着色器,但问题是我的代码中存在一个超级巨大的瓶颈,我基本上失去了1000倍的性能。

我已经创建了一些示例代码来演示这个问题,代码的功能是无关紧要的,并且没有多大意义。

我丢失了大量的性能写入计算缓冲区cBuffer[id].vel += vel;(着色器代码中的 ),启用该行后,我与pCount = (1024 * 256);〜统一得到大约40fps 256k( in c#code )但是如果我在着色器中禁用写入缓冲行,我可以在>处pCount = (1024 * 1024 * 64); ~64m 60fps,没问题。我猜它是因为不同的线程尝试写入相同的内存并且必须等待其他线程完成,但有没有办法以更聪明的方式做到这一点?

Download Unity and Visual Studio Project files (团结2017.3.0f3)

C#CODE:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class computeScript : MonoBehaviour
{
    public ComputeShader cShader;

    struct Particle
    {
        public Vector2 pos;
        public Vector2 vel;
    }

    ComputeBuffer cBuffer;
    const int pCount = (1024 * 256); // <--- set count
    Particle[] particles = new Particle[pCount];
    int kernelCSMain;

    void Start ()
    {
        kernelCSMain = cShader.FindKernel("CSMain");
        cShader.SetInt("pCount", pCount);

        cBuffer = new ComputeBuffer(pCount, (sizeof(float) * 4), ComputeBufferType.Default);

        for(int i = 0; i < pCount; i++)
        {
            particles[i] = new Particle();
            particles[i].pos = new Vector2();
            particles[i].vel = new Vector2();
        }
            cBuffer.SetData(particles);
    }

    void Update ()
    {
        cShader.SetBuffer(kernelCSMain, "cBuffer", cBuffer);
        cShader.Dispatch(kernelCSMain, pCount / 1024, 1, 1);
    }

    void OnDestroy()
    {
        cBuffer.Release();
    }
}

计算着色器代码:

#pragma kernel CSMain

struct Particle
{
    float2 pos;
    float2 vel;
};

RWStructuredBuffer<Particle> cBuffer;
int pCount;

[numthreads(1024,1,1)]
void CSMain (uint id : SV_DispatchThreadID)
{
    float2 vel;
    for (int i = 0; i < pCount; i++) 
    {
        vel += (cBuffer[id].pos + cBuffer[i].pos);
    }
    cBuffer[id].vel += vel; // <---- this line is the issue
}

1 个答案:

答案 0 :(得分:1)

问题不是来自写,问题来自死代码消除。

如果我接受你的代码,没有写:

[numthreads(1024,1,1)]
void CSMain (uint id : SV_DispatchThreadID)
{
    float2 vel;
    for (int i = 0; i < pCount; i++) 
    {
        vel += (cBuffer[id].pos + cBuffer[i].pos);
    }
}

编译器将检测到在任何地方都没有使用vel(按照,不写),因此将删除分配它的代码。 然后就行:

vel += (cBuffer[id].pos + cBuffer[i].pos);

被删除(因为没有使用vel),编译器检测到循环内容现在是空的,所以也摆脱了循环。

因此,在您的情况下,对该行进行注释最终会得到一个无效的空着色器。

为了演示它,以下是使用fxc:

编译着色器的结果
fxc cs.fx /O3 /Tcs_5_0 /ECSMain

首先启用写入:

cs_5_0
dcl_globalFlags refactoringAllowed
dcl_constantbuffer CB0[1], immediateIndexed
dcl_uav_structured u0, 16
dcl_input vThreadID.x
dcl_temps 2
dcl_thread_group 1024, 1, 1
ld_structured_indexable(structured_buffer, stride=16)
(mixed,mixed,mixed,mixed) r0.xy, vThreadID.x, l(0), u0.xyxx
mov r0.zw, l(0,0,0,0)
mov r1.x, l(0)
loop
  ige r1.y, r1.x, cb0[0].x
  breakc_nz r1.y
  ld_structured_indexable(structured_buffer, stride=16)
  (mixed,mixed,mixed,mixed) r1.yz, r1.x, l(0), u0.xxyx
  add r1.yz, r0.xxyx, r1.yyzy
  add r0.zw, r0.zzzw, r1.yyyz
  iadd r1.x, r1.x, l(1)
endloop
ld_structured_indexable(structured_buffer, stride=16)
(mixed,mixed,mixed,mixed) r0.xy, vThreadID.x, l(8), u0.xyxx
add r0.xy, r0.zwzz, r0.xyxx
store_structured u0.xy, vThreadID.x, l(8), r0.xyxx
ret
// Approximately 15 instruction slots used

现在,如果您评论您的写入并运行相同的编译任务:

cs_5_0
dcl_globalFlags refactoringAllowed
dcl_thread_group 1024, 1, 1
ret
// Approximately 1 instruction slots used

此外,请注意,在您的情况下,您在计算中运行n ^ 2算法,您的每个粒子都会相互检查(对于每个262144),您正在执行68719476736&#34;迭代&# 34; (这解释了一旦启用写入就会严重损失性能)