Question

我有以下音频代码，我认为这是在加速框架中使用vDSP的一个很好的选择。

// --- get pointers for buffer lists
float* left = (float*)audio->mBuffers[0].mData;
float* right = numChans == 2 ? (float*)audio->mBuffers[1].mData : NULL;

float dLeftAccum = 0.0;
float dRightAccum = 0.0;

float fMix = 0.25; // -12dB HR per note

// --- the frame processing loop
for(UInt32 frame=0; frame<inNumberFrames; ++frame)
{
    // --- zero out for each trip through loop
    dLeftAccum = 0.0;
    dRightAccum = 0.0;
    float dLeft = 0.0;
    float dRight = 0.0;

    // --- synthesize and accumulate each note's sample
    for(int i=0; i<MAX_VOICES; i++)
    {
        // --- render
        if(m_pVoiceArray[i]) 
            m_pVoiceArray[i]->doVoice(dLeft, dRight);

        // --- accumulate and scale
        dLeftAccum += fMix*(float)dLeft;
        dRightAccum += fMix*(float)dRight;

    }

    // --- accumulate in output buffers
    // --- mono
    left[frame] = (float)dLeftAccum;

    // --- stereo
    if(right) right[frame] = (float)dRightAccum;
}

// needed???
//  mAbsoluteSampleFrame += inNumberFrames;

return noErr;

因此我将其修改为使用vDSP，在帧块的末尾乘以fMix。

// --- the frame processing loop
for(UInt32 frame=0; frame<inNumberFrames; ++frame)
{
    // --- zero out for each trip through loop
    dLeftAccum = 0.0;
    dRightAccum = 0.0;
    float dLeft = 0.0;
    float dRight = 0.0;

    // --- synthesize and accumulate each note's sample
    for(int i=0; i<MAX_VOICES; i++)
    {
        // --- render
        if(m_pVoiceArray[i]) 
            m_pVoiceArray[i]->doVoice(dLeft, dRight);

        // --- accumulate and scale
        dLeftAccum += (float)dLeft;
        dRightAccum += (float)dRight;

    }

    // --- accumulate in output buffers
    // --- mono
    left[frame] = (float)dLeftAccum;

    // --- stereo
    if(right) right[frame] = (float)dRightAccum;
}
vDSP_vsmul(left, 1, &fMix, left, 1, inNumberFrames);
vDSP_vsmul(right, 1, &fMix, right, 1, inNumberFrames);
// needed???
//  mAbsoluteSampleFrame += inNumberFrames;

return noErr;

但是，我的CPU使用率仍然保持不变。我认为在这里使用vDSP没有明显的好处。我这样做了吗？非常感谢。

对于矢量操作还是新手，对我很轻松:)

如果我应该做一些明显的优化（加速框架之外），请随时向我指出，谢谢！

Answer 1

您的矢量调用在音频采样率下每个样本执行2次乘法运算。如果你的采样率是192kHz那么你只是说每秒384000次乘法 - 这还不足以在现代CPU上注册。此外，您正在移动现有的倍数到另一个地方。如果您查看生成的程序集，我猜测编译器会优化您的原始代码，并且vDSP调用中的任何加速都将被您需要第二个循环的事实所抵消。

另一个需要注意的重要事项是，当矢量数据在16字节边界上对齐时，所有vDSP函数都能更好地工作。如果你看一下SSE2指令集（我肯定vDSP会大量使用），你会发现许多指令都有对齐数据的版本和另一个未对齐数据的版本。

你在gcc中对齐数据的方式是这样的：

float inVector[8] = {1, 2, 3, 4, 5, 6, 7, 8} __attribute__ ((aligned(16)));

或者，如果你在堆上分配，看看aligned_malloc是否可用。

加速使用的框架，没有可观察到的加速

1 个答案: