Question

我想出了一点微观优化的好奇心：

struct Timer {
    bool running{false};
    int ticks{0};

    void step_versionOne(int mStepSize) {
        if(running) ticks += mStepSize;
    }

    void step_versionTwo(int mStepSize) {
        ticks += mStepSize * static_cast<int>(running);
    }
};

这两种方法似乎实际上做了同样的事情。第二个版本是否避免使用分支（因此，比第一个版本更快），或者是否有任何编译器能够使用-O3进行此类优化？

Answer 1

是的，你的技巧可以避免分支，它会让它更快......有时候。

我写了基准，在各种情况下比较这些解决方案，以及我自己的：

ticks += mStepSize & -static_cast<int>(running)

我的结果如下：

Off:
 branch: 399949150
 mul:    399940271
 andneg: 277546678
On:
 branch: 204035423
 mul:    399937142
 andneg: 277581853
Pattern:
 branch: 327724860
 mul:    400010363
 andneg: 277551446
Random:
 branch: 915235440
 mul:    399916440
 andneg: 277537411

Off是定时器关闭的时候。在这种情况下，解决方案大约需要一段时间。

On是打开它们的时候。分支解决方案快两倍。

Pattern是他们处于100110模式的时候。性能类似，但分支速度要快一些。

Random是分支无法预测的时候。在这种情况下，乘法速度提高了2倍多。

在所有情况下，我的攻击技巧都是最快的，除了分支获胜的On。

请注意，此基准测试不一定代表所有编译器版本的处理器等。即使基准测试的微小更改也可以颠倒结果（例如，如果编译器可以内联知道mStepSize是1而不是乘法可以实际上最快）。

基准代码：

#include<array>
#include<iostream>
#include<chrono>

struct Timer {
    bool running{false};
    int ticks{0};

    void branch(int mStepSize) {
        if(running) ticks += mStepSize;
    }

    void mul(int mStepSize) {
        ticks += mStepSize * static_cast<int>(running);
    }

    void andneg(int mStepSize) {
        ticks += mStepSize & -static_cast<int>(running);
    }
};

void run(std::array<Timer, 256>& timers, int step) {
    auto start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.branch(step);
    auto end = std::chrono::steady_clock::now();
    std::cout << "branch: " << (end - start).count() << std::endl;
    start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.mul(step);
    end = std::chrono::steady_clock::now();
    std::cout << "mul:    " << (end - start).count() << std::endl;
    start = std::chrono::steady_clock::now();
    for(int i = 0; i < 1000000; i++)
        for(auto& t : timers)
            t.andneg(step);
    end = std::chrono::steady_clock::now();
    std::cout << "andneg: " << (end - start).count() << std::endl;
}

int main() {
    std::array<Timer, 256> timers;
    int step = rand() % 256;

    run(timers, step); // warm up
    std::cout << "Off:\n";
    run(timers, step);
    for(auto& t : timers)
        t.running = true;
    std::cout << "On:\n";
    run(timers, step);
    std::array<bool, 6> pattern = {1, 0, 0, 1, 1, 0};
    for(int i = 0; i < 256; i++)
        timers[i].running = pattern[i % 6];
    std::cout << "Pattern:\n";
    run(timers, step);
    for(auto& t : timers)
        t.running = rand()&1;
    std::cout << "Random:\n";
    run(timers, step);
    for(auto& t : timers)
        std::cout << t.ticks << ' ';
    return 0;
}

Answer 2

Does the second version avoid a branch

如果编译代码以获得汇编程序输出g++ -o test.s test.cpp -S，您会发现在第二个函数中确实避免了分支。

and consequently, is faster than the first version

我运行了每个函数2147483647或INT_MAX次，在每次迭代中我随机为您的running结构的Timer成员随机分配了一个布尔值这段代码：

int main() {
    const int max = std::numeric_limits<int>::max();
    timestamp_t start, end, one, two;
    Timer t_one, t_two;
    double percent;

    srand(time(NULL));

    start = get_timestamp();
    for(int i = 0; i < max; ++i) {
        t_one.running = rand() % 2;
        t_one.step_versionOne(1);
    }
    end = get_timestamp();
    one = end - start;

    std::cout << "step_versionOne      = " << one << std::endl;

    start = get_timestamp();
    for(int i = 0; i < max; ++i) {
        t_two.running = rand() % 2;
        t_two.step_versionTwo(1);
    }
    end = get_timestamp();
    two = end - start;

    percent = (one - two) / static_cast<double>(one) * 100.0;

    std::cout << "step_versionTwo      = " << two << std::endl;
    std::cout << "step_one - step_two  = " << one - two << std::endl;
    std::cout << "one fast than two by = " << percent << std::endl;
 }

这些是我得到的结果：

step_versionOne      = 39738380
step_versionTwo      = 26047337
step_one - step_two  = 13691043
one fast than two by = 34.4529%

所以是的，第二个功能显然更快，大约35％。请注意，对于较少的迭代次数，定时性能的增加百分比在30％到55％之间变化，而它运行的时间似乎稳定在35％左右。这可能是由于在模拟运行时零星执行系统任务，这变得不那么零散，即运行模拟器的时间越长（尽管这只是我的假设，我不知道它是否真的存在）< / p>

总而言之，很好的问题，我今天学到了一些东西！

MORE：

当然，通过随机生成running，我们实际上在第一个函数中渲染分支预测无用，因此上面的结果并不太令人惊讶。但是，如果我们决定在循环迭代期间不改变running而是将其保留为默认值，在本例中为false，则分支预测将在第一个函数中发挥作用，并且实际上会更快这些结果显示：差不多20％：

step_versionOne      = 6273942
step_versionTwo      = 7809508
step_two - step_one  = 1535566
two fast than one by = 19.6628

因为running在整个执行过程中保持不变，请注意模拟时间比随机变化的running要短得多 - 可能是编译器优化的结果。

为什么第二种功能在这种情况下会变慢？好吧，分支预测会很快意识到第一个函数中的条件永远不会被满足，因此将首先停止检查（好像if(running) ticks += mStepSize;甚至不存在）。另一方面，第二个函数仍然必须在每次迭代中执行此指令ticks += mStepSize * static_cast<int>(running);，从而使第一个函数更有效。

但是如果我们将running设置为true怎么办？好吧，分支预测会再次启动，但是，这次，第一个函数必须在每次迭代中评估ticks += mStepSize;;这里是running{true}：

时的结果

step_versionOne      = 7522095
step_versionTwo      = 7891948
step_two - step_one  = 369853
two fast than one by = 4.68646

请注意，无论step_versionTwo是running还是true，false都需要一致的时间。但它仍然需要比step_versionTwo更长的时间。好吧，这可能是因为我懒得运行它很多次以确定它是否一直更快或是否是一次性侥幸（每次运行时结果略有不同，因为操作系统必须在后台运行它并不总是会做同样的事情）。但是如果它一直更快，那可能是因为函数2（ticks += mStepSize * static_cast<int>(running);）的算术运算比函数1（ticks += mStepSize;）更多。

最后，让我们使用优化进行编译 - g++ -o test test.cpp -std=c++11 -O1让我们将running还原为false，然后检查结果：

step_versionOne      = 704973
step_versionTwo      = 695052

或多或少相同。编译器将执行其优化传递，并且实现running始终为false，因此，出于所有意图和目的，删除step_versionOne的主体，因此当您从循环中调用它时在main中，它只会调用函数并返回。

另一方面，在优化第二个函数时，它会意识到ticks += mStepSize * static_cast<int>(running);将始终生成相同的结果，即0，因此它也不会执行它。

总而言之，如果我是正确的（如果没有，请纠正我，我对此很新），从main循环中调用两个函数时所得到的只是他们的开销

P.S。这是第一种情况的结果（running是在每次迭代中随机生成的），当使用优化进行编译时。

step_versionOne      = 18868782
step_versionTwo      = 18812315
step_two - step_one  = 56467
one fast than two by = 0.299261

在计算中使用bool来避免分支

2 个答案:

MORE：