Question

不久前，我问了一个关于堆栈溢出的问题，并向人们展示了如何在C ++中执行rdtsc操作码。我最近使用rdtsc创建了一个基准函数，如下所示：

inline unsigned long long rdtsc() {
  unsigned int lo, hi;
  asm volatile (
     "cpuid \n"
     "rdtsc" 
   : "=a"(lo), "=d"(hi) /* outputs */
   : "a"(0)             /* inputs */
   : "%ebx", "%ecx");     /* clobbers*/
  return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}

typedef uint64_t (*FuncOneInt)(uint32_t n);
/**
     time a function that takes an integer parameter and returns a 64 bit number
     Since this is capable of timing in clock cycles, we won't have to do it a
     huge number of times and divide, we can literally count clocks.
     Don't forget that everything takes time including getting into and out of the
     function.  You may want to time an empty function.  The time to do the computation
     can be compute by taking the time of the function you want minus the empty one.
 */
void clockBench(const char* msg, uint32_t n, FuncOneInt f) {
    uint64_t t0 = rdtsc();
    uint64_t r = f(n);
    uint64_t t1 = rdtsc();
    std::cout << msg << "n=" << n << "\telapsed=" << (t1-t0) << '\n';
}

因此，我假设如果对一个函数进行基准测试，我将（大约）具有执行该时钟所花费的时钟周期数。我还假设，如果我想减去进入或退出该函数所需的时钟周期数，则应该对一个空函数进行基准测试，然后在其中写入所需的代码。

以下是示例：

uint64_t empty(uint32_t n) {
    return 0;
}

uint64_t sum1Ton(uint32_t n) {
    uint64_t s = 0;
    for (int i = 1; i <= n; i++)
        s += i;
    return s;
}

使用以下代码编译代码

g++ -g -O2

我可以理解是否由于中断或某些其他情况而导致某些错误，但是鉴于这些例程很短，并且n被选择为较小，我假设我可以看到实数。但是令我惊讶的是，这是两次连续运行的结果

empty n=100 elapsed=438
Sum 1 to n=100  elapsed=887

empty n=100 elapsed=357
Sum 1 to n=100  elapsed=347

始终，空函数表明它的作用超出了应有的作用。

毕竟，进入和退出该功能仅涉及几条指令。真正的工作是在循环中完成的。不用担心差异很大的事实。在第二次运行中，empty函数声称需要357个时钟周期，而总和却更少，这很荒谬。

发生了什么事？

Answer 1

始终，空函数表明它的作用超出了应有的作用。

您在定时间隔内有cpuid 。根据Agner Fog的测试，在英特尔Sandybridge系列CPU上的cpuid需要100到250个核心时钟周期（取决于您忽略设置的输入）。（https://agner.org/optimize/）。

但是您不是在测量核心时钟周期，而是在测量RDTSC参考周期，这可能会大大缩短。（例如，我的Skylake i7-6700k的空闲频率为800MHz，但参考时钟频率为4008 MHz。）有关尝试对rdtsc进行规范回答的尝试，请参见Get CPU cycle count?。

请先预热CPU，或在另一个内核上运行pause繁忙循环以使其保持在最大值（假定它是台式机/笔记本电脑的双核或四核，其中所有核心频率都锁定在一起。）< / p>

不要介意差异很大的事实。在第二次运行中，empty函数声称需要357个时钟周期，而总和却更少，这很荒谬。

效果是否也一致？

在打印第三行消息期间/之后，也许您的CPU加速至全速运行，从而使最后一个定时区域的运行速度大大提高了？（Why does this delay-loop start to run faster after several iterations with no sleep?）。

在cpuid之前，IDK对eax和ecx中不同垃圾的影响有多大。将其替换为lfence以消除该问题，并使用一种开销低得多的方式来序列化rdtsc。

使用rdtsc在Intel上进行汇编程序基准测试会给出奇怪的答案，为什么？

1 个答案: