1：0，fps 400 +。

Question

我正在使用3d软件渲染器。在我的代码中，我已经声明了一个没有默认构造函数的结构Arti3DVSOutput。它是这样的：

struct Arti3DVSOutput {
    vec4    vPosition;   // vec4 has a default ctor that sets all 4 floats 0.0f.
    float   Varyings[g_ciMaxVaryingNum];
};

void Arti3DDevice::GetTransformedVertex(uint32_t i_iVertexIndex, Arti3DTransformedVertex *out)
{
    // Try to fetch result from cache.
    uint32_t iCacheIndex = i_iVertexIndex&(g_ciCacheSize - 1);
    if (vCache[iCacheIndex].tag == i_iVertexIndex)
    {
        *out = *(vCache[iCacheIndex].v);
    }
    else
    {
        // Cache miss. Need calculation.

        Arti3DVSInput vsinput;

        // omit some codes that fill in "vsinput"..........

        Arti3DVSOutput vs_output;
        // Whether comment the following line makes a big difference.
        //memset(&vs_output, 0, sizeof(Arti3DVSOutput));

        mRC.pfnVS(&vsinput, &mRC.globals, &vs_output);
        *out = vs_output;

        // Store results in cache.
        vCache[iCacheIndex].tag = i_iVertexIndex;
        vCache[iCacheIndex].v = &tvBuffer[i_iVertexIndex];
        tvBuffer[i_iVertexIndex] = vs_output;
    }
}

mRC.pfnVS是一个函数指针，它指向的函数实现如下：

void NewCubeVS(Arti3DVSInput *i_pVSInput, Arti3DShaderUniform* i_pUniform, Arti3DVSOutput *o_pVSOutput)
{
    o_pVSOutput->vPosition = i_pUniform->mvp * i_pVSInput->ShaderInputs[0];
    o_pVSOutput->Varyings[0] = i_pVSInput->ShaderInputs[1].x;
    o_pVSOutput->Varyings[1] = i_pVSInput->ShaderInputs[1].y;
    o_pVSOutput->Varyings[2] = i_pVSInput->ShaderInputs[1].z;
}

正如您所看到的，我在此功能中所做的只是填写＆＃34; o_pVSOutput＆＃34;的一些成员。不执行读取操作。问题出现了：当局部变量＆＃34; vsoutput＆＃34;时，渲染器的性能从400+ fps到60 + fps下降很大。在我将其地址传递给函数（＆＃34; NewCubeVS＆＃34;在本例中）作为第三个参数之前，未设置为0。

渲染的图像完全相同。当我关闭优化（-O0）时，两个版本的性能是相同的。一旦我打开最佳值（-O1或-O2或-O3），性能差异就会再次出现。

我描述了这个程序并发现了一些非常奇怪的东西。＆＃34; vsoutput未初始化的时间成本的增加＆＃34;版本不会发生在函数＆＃34; GetTransformedVertex＆＃34;中，甚至不在它附近。在＆＃34; GetTransformedVertex＆＃34;之后，某些SSE内在函数方式会导致时间增加。叫做。我真的很困惑......

仅供参考，我使用的是Visual Studio 2013社区。

现在我知道这种性能下降是由于整体结构造成的。但我不知道怎么做。它隐含地关闭了一些编译器的优化选项吗？

如有必要，我会将我的源代码发布到我的github供您参考。

任何意见表示赞赏！先感谢您。

更新：由@KerrekSB启发，我做了一些测试。使用不同的值调用memset()，性能可能会大不相同！

1：0，fps 400 +。

2：1,2,3,4,5 .... fps 40~60。

然后我删除了memset()并明确为Arti3DOutput实施了一个ctor。除了将Varyings []中的所有浮点数设置为一个有效的浮点值（例如0.0f，1.5f，100.0f ....）之外，ctor什么也没做。哈哈，400 + fps。

到目前为止，似乎Arti3DVSOutput中的值/内容对性能有很大影响。

然后我做了一些测试，以找出Arti3DVSOutput的哪一段内存确实很重要。这是代码。

Arti3DVSOutput vs_output;  // No explict default ctor in this version.
// Comment the following 12 lines of code one by one to find out which piece of unitialized memory really matters.
vs_output.Varyings[0] = 0.0f;
vs_output.Varyings[1] = 0.0f;
vs_output.Varyings[2] = 0.0f;
vs_output.Varyings[3] = 0.0f;
vs_output.Varyings[4] = 0.0f;
vs_output.Varyings[5] = 0.0f;
vs_output.Varyings[6] = 0.0f;
vs_output.Varyings[7] = 0.0f;
vs_output.Varyings[8] = 0.0f;
vs_output.Varyings[9] = 0.0f;
vs_output.Varyings[10] = 0.0f;
vs_output.Varyings[11] = 0.0f;
mRC.pfnVS(&vsinput, &mRC.globals, &vs_output);

逐条注释12行代码并运行程序。

The result is shown as follows:
comment line#    FPS
0                420
1                420
2                420
3                420
4                200
5                420
6                280
7                195
8                197
9                200
10               200
11               420
0,1,2,3,5,11     420
4,6,7,8,9,10     60

似乎Varyings []的第4，第6，第7，第8，第9和第10个元素都会对perfermance下降做出一些贡献。

我真的很困惑编译器在我背后所做的事情。编译器必须进行某种值检查吗？

解决方案：

我明白了！问题的根源是SSE内在函数之后将未初始化或未正确初始化的浮点值用作参数。那些无效的浮点数会产生异常并大大减慢SSE内在函数的速度。

C ++本地未初始化的结构导致性能下降？

1：0，fps 400 +。

2：1,2,3,4,5 .... fps 40~60。

解决方案：

0 个答案: