衡量一切！

Question

我们正在开发一段代码，可以检查用户是否应该被允许在一段时间内进入某个扇区，我的一位同事创建了一个函数，在下面的代码中是 isAllowed 并包含几个比较，我采用了一种不同的方法，即函数 isAllowed2 ，它使用时间段之间的秒数。

起初我们毫不怀疑他的功能会更快，但在实际运行代码并比较速度时却不是这样，即使差异是我们可以完全忽略的，我们也想知道它为什么会这样。那个“应该”更快的那个实际上更慢。

考虑以下代码：

#include <iostream>
#include <chrono>

using namespace std;
using namespace std::chrono;

struct timing {
    short hour;
    short minute;
};

bool isAllowed(timing &from, timing &to, timing &actual) {
    return !(((from.hour > to.hour && (actual.hour >= from.hour || actual.hour <= to.hour)) ||
        (actual.hour >= from.hour && actual.hour <= to.hour)) &&
        !(actual.minute > from.minute && actual.minute < to.minute));
}

long getSecs(short hour, short minutes) {

    return (hour * 3600) + (minutes * 60);

}

bool isAllowed2(timing &from, timing &to, timing &current) {

    long secsFrom = getSecs(from.hour, from.minute);
    long secsTo = getSecs(to.hour, to.minute);
    long secsCurrent = getSecs(current.hour, current.minute);

    if (secsFrom > secsTo) secsTo += 24 * 60 * 60;
    if (secsCurrent > secsFrom && secsCurrent < secsTo) {
        return false;
    }

    return true;
}

int main() {
    //debug messages
    std::string okay = " - ok";
    std::string error = " - er";

    std::string receive = " - allowed";
    std::string notReceive = " - denied";

    //testing times
    int const testDataCount = 5;
    timing from[testDataCount] = {
        { 16, 30 },
        { 8,  30 },
        { 10, 30 },
        { 0, 30 },
        { 0, 0 }
    };
    timing to[testDataCount] = {
        { 8,  30 },
        { 20, 0 },
        { 20, 0 },
        { 6, 0 },
        { 7, 0 }
    };

    for (int i = 0; i < testDataCount; i++) {
        std::cout << i + 1 << ": " << from[i].hour << ":" << from[i].minute << " to " << to[i].hour << ":"
            << to[i].minute << std::endl;
    }

    //test current times
    timing current[5] = {
        { 12, 0 },
        { 23, 0 },
        { 17, 30 },
        { 15, 12 },
        { 0, 20 }
    };

    bool ergValues[][testDataCount] = {
        { true,  false, false, true, true },
        { false, true,  true, true, true },
        { false, false, false, true, true },
        { true,  false, false, true, true },
        { false,  true, true, true, false }
    };

    long totalNs1 = 0;
    long totalNs2 = 0;

    for (int i = 0; i < 4; i++) {
        std::cout << std::endl << i + 1 << ". Test: " << current[i].hour << ":" << current[i].minute << std::endl;
        for (int j = 0; j < testDataCount; j++) {

            high_resolution_clock::time_point t1 = high_resolution_clock::now();
            bool response = isAllowed(from[j], to[j], current[i]);
            high_resolution_clock::time_point t2 = high_resolution_clock::now();

            high_resolution_clock::time_point t3 = high_resolution_clock::now();
            bool response2 = isAllowed2(from[j], to[j], current[i]);
            high_resolution_clock::time_point t4 = high_resolution_clock::now();

            long ns1 = duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
            totalNs1 += ns1;
            long ns2 = duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
            totalNs2 += ns2;

            std::cout << j + 1 << "\t\t:1:" << ns1 << "ns: " << response << (response == ergValues[i][j] ? okay : error) << "\t\t:2:" << ns2 << "ms: " << response2 << (response2 == ergValues[i][j] ? okay : error) << "\t\t"
                << (ergValues[i][j] ? receive : notReceive) << std::endl;
        }
    }

    std::cout << "\r\ntotalNs1 = " << totalNs1 << "\r\ntotalNs2 = " << totalNs2 << "\r\n\r\n";

    return 0;
}

结果显然总是不同，但无论totalNs2总是小于totalNs1。

例如：

totalNs1 = 38796
totalNs2 = 25913

我在Windows 10和Debian 8上的AMD Phenom II X4和Intel i7-3770上进行了测试，结果非常相似。

所以最后问题是，为什么函数 isAllowed2 比 isAllowed 更快？

注意：这主要是一个好奇心问题，如果有人认为标题或标签不是最合适的请告诉我，我会相应更改它们，请原谅任何最终的语法错误因为英语不是我的母语。

修改

与此同时，在了解了微观基准测试的不准确性之后，我一直在根据所有建议和评论进一步研究，包括this令人难以置信的详细解答（非常感谢 Baum mit Augen 用于链接this惊人的谈话，这有很多帮助）我最终使用Google microbenchmark library获得更多“准确”的结果，似乎 isAllowed 功能在事实上更快（没有优化编译），如图书馆的输出所示。

Run on (8 X 2395 MHz CPU s)
-----------------------------------------------------------------------
Benchmark                                Time           CPU Iterations
-----------------------------------------------------------------------
BM_isAllowed/2/min_time:2.000           22 ns         22 ns  128000000
BM_isAllowed/4/min_time:2.000           22 ns         22 ns  137846154
BM_isAllowed/8/min_time:2.000           22 ns         22 ns  128000000
BM_isAllowed/16/min_time:2.000          22 ns         22 ns  128000000
BM_isAllowed/22/min_time:2.000          22 ns         22 ns  137846154
BM_isAllowed2/2/min_time:2.000          24 ns         24 ns  112000000
BM_isAllowed2/4/min_time:2.000          24 ns         24 ns  119466667
BM_isAllowed2/8/min_time:2.000          24 ns         24 ns  119466667
BM_isAllowed2/16/min_time:2.000         24 ns         24 ns  119466667
BM_isAllowed2/22/min_time:2.000         24 ns         24 ns  119466667

注意：正如 Martin Bonner 所指出的， isAllowed 函数似乎有一个逻辑缺陷，不要使用它的生产代码。

下面你可以找到我用来做这个基准测试的代码，如果我有任何不足之处请告诉我，因为我不熟悉Google library。

重要的是，此代码是使用Visual Studio 2015编译的，并且对于我们要测试的部分，应禁用。

#include "benchmark/benchmark.h"

using namespace std;
using namespace benchmark;

#pragma optimize( "[optimization-list]", {on | off} )

volatile const long extraDay = 24 * 60 * 60;

#pragma optimize( "", off )

struct timing {
    short hour;
    short minute;
    timing(short hour, short minute) : hour(hour), minute(minute) {}
};

static void BM_isAllowed(benchmark::State& state) {

    while (state.KeepRunning())
    {
        timing from(state.range(0), state.range(0));
        timing to(state.range(0), state.range(0));
        timing current(state.range(0), state.range(0));

        bool b = !(((from.hour > to.hour && (current.hour >= from.hour || current.hour <= to.hour)) ||
            (current.hour >= from.hour && current.hour <= to.hour)) &&
            !(current.minute > from.minute && current.minute < to.minute));
    }
}

static void BM_isAllowed2(benchmark::State& state) {

    while (state.KeepRunning())
    {
        timing from(state.range(0), state.range(0));
        timing to(state.range(0), state.range(0));
        timing current(state.range(0), state.range(0));

        bool b;
        long secsFrom = secsFrom = (from.hour * 3600) + (from.minute * 60);
        long secsTo = (to.hour * 3600) + (to.minute * 60);
        long secsCurrent = (current.hour * 3600) + (current.minute * 60);

        if (secsFrom > secsTo)
            secsTo += extraDay;
        if (secsCurrent > secsFrom && secsCurrent < secsTo)
            b = false;
        else
            b = true;
    }
}
#pragma optimize( "", on ) 

BENCHMARK(BM_isAllowed)->RangeMultiplier(2)->Range(2, 22)->MinTime(2);
BENCHMARK(BM_isAllowed2)->RangeMultiplier(2)->Range(2, 22)->MinTime(2);
BENCHMARK_MAIN();

Answer 1

这没有黄金法则。不幸的是，像这样的代码的性能是众所周知难以预测的。最重要的是从中获取

衡量一切！

现在你的代码中发生了什么：正如其他人正确指出的那样，我们可以观察到isAllowed使用分支编译成函数，而isAllowed2最终变为无分支。

分支在谈论表现时很有意思：它们介于两者之间，实际上是免费的，而且非常昂贵，包含在内。这是由于一个称为分支预测器的CPU组件。它试图预测控制流将采用哪个分支，并使CPU推测性地执行它。如果它猜对了，分支是免费的。如果猜错了，分支就很贵了。可以在this answer中找到对该概念的详尽而详细的解释，包括一些数字。

所以现在我们需要决定是否需要分支或无分支版本。 一般来说，既不需要比另一个更快！这实际上取决于目标CPU预测分支的程度，当然这取决于实际输入。（因此，选择是否将函数编译为分支或无分支结果对于编译器来说是一个难题，因为他们不知道将运行二进制文件的CPU，也不知道期望什么样的输入数据。例如，参见this blogpost。）

所以如果您的基准测试实际上是正确的 ^†，我们已经确定在您的CPU上，分支太难以预测，无法击败相对便宜的整数算术。这也可能是由于测试用例数量很少，分支预测器无法从这么少的调用中学习模式。但同样，我们不能只是这样或那样，我们必须看看具体情况下的实际表现。

^†如评论中所述，对于良好的测量，执行时间有点短，我发现我的机器有很大的偏差。有关微基准测试的信息，您可以查看this talk，这比人们想象的更难。

另外，正如马丁·邦纳（Martin Bonner）帮助noticed，你的两个职能部门不做同样的事情，你当然必须修正它以获得正确的基准。

Answer 2

因为您没有衡量您想要衡量的内容。

事实上，在我的计算机上执行你的两个函数需要大约40ns，但是如果我使用你的测试代码，我会得到500ns的结果。

您没有执行所需的测量，因为： 1.仅执行一次这些功能的时间与实际获得时钟的功能的执行时间相同（甚至更小）。作为拇指的规则，要进行测试，请尝试测量大于10毫秒的时间。 2.在两个标记之间，编译器已经放置了一个积极内联和优化的函数版本，因为它知道输入是什么，这可能是在实际情况下不会发生的。

如果将两个函数的定义放在与定义输入的文件不同的文件中：

@Override
public int getCount() {
      if(!isAccess) {
        return 3;
      } else {
        return 4;
      }
    }

然后在“滴答声”之间执行一百万次你的功能，你将获得更可靠的结果：

//is_allowed.cpp
struct timing {
    short hour;
    short minute;
};
bool isAllowed(timing &from, timing &to, timing &actual) {
    return !(((from.hour > to.hour && (actual.hour >= from.hour || actual.hour <= to.hour)) ||
        (actual.hour >= from.hour && actual.hour <= to.hour)) &&
        !(actual.minute > from.minute && actual.minute < to.minute));
}

static long getSecs(short hour, short minutes) {

    return (hour * 3600) + (minutes * 60);

}

bool isAllowed2(timing &from, timing &to, timing &current) {

    long secsFrom = getSecs(from.hour, from.minute);
    long secsTo = getSecs(to.hour, to.minute);
    long secsCurrent = getSecs(current.hour, current.minute);

    if (secsFrom > secsTo) secsTo += 24 * 60 * 60;
    if (secsCurrent > secsFrom && secsCurrent < secsTo) {
        return false;
    }

    return true;
}

惊喜，我明白了：

int main(){
//...

            high_resolution_clock::time_point t1 = high_resolution_clock::now();
            for (int x=1;x<1000000;++x)
               isAllowed(from[j], to[j], current[i]);
            high_resolution_clock::time_point t2 = high_resolution_clock::now();

            high_resolution_clock::time_point t3 = high_resolution_clock::now();
            for (int x=1;x<1000000;++x)
               isAllowed2(from[j], to[j], current[i]);
            high_resolution_clock::time_point t4 = high_resolution_clock::now();

            long ns1 = duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
            totalNs1 += ns1;
            long ns2 = duration_cast<std::chrono::nanoseconds>(t4 - t3).count();
            totalNs2 += ns2;

//            std::cout << j + 1 << "\t\t:1:" << ns1 << "ns: " << response << (response == ergValues[i][j] ? okay : error) << "\t\t:2:" << ns2 << "ms: " << response2 << (response2 == ergValues[i][j] ? okay : error) << "\t\t"
//                << (ergValues[i][j] ? receive : notReceive) << std::endl;
//...
    }

使用完全相同的计算机和编译器代码时，我得到了：

    totalNs1=38085793  //(38ms)
    totalNs2=52182920  //(52ms)

正如您所料，totalNs1 = 927 totalNs2 = 587的第一个版本实际上是赢家！

几个比较如何比某些计算慢？

修改

2 个答案:

衡量一切！