我正在尝试确保函数的性能。
double microbenchmark_get_sqrt_latency()
{
myInt64 start, end;
list<double> cyclesList;
int num_runs = 40;
double cycles = 0.;
double multiplier = 1.;
double x = 500;
// Repeat the measurement 1000 times
for (size_t i = 0; i < 1000; i++)
{
// Measuring...
start = start_tsc();
for (size_t j = 0; j < num_runs; ++j)
{
sqrtsd(x);
}
// Maybe this instruction is called before the loop ends? somehow?
end = stop_tsc(start);
// Doesn't return the correct number of cycles because
cycles = ((double)end) / num_runs;
cyclesList.push_back(cycles);
}
cyclesList.sort();
auto it = cyclesList.begin();
std::advance(it, cyclesList.size() / 2);
return *it;
}
这里的问题是对于变量end
来说,它代表自第一条rdtsc
指令以来发生的周期数始终等于22-24,即使num_runs
变化时最多10000。除了在for循环的第一次迭代之后可能移动指令之外,我对此没有任何解释。
我正在使用的编译器和编译器标志为:-O3 -fno-tree-vectorize -march=skylake -std=c++17
这是start_tsc()
和stop_tsc()
的实现:
#define RDTSC(cpu_c) \
ASM VOLATILE("rdtsc" \
: "=a"((cpu_c).int32.lo), "=d"((cpu_c).int32.hi))
#define CPUID() \
ASM VOLATILE("cpuid" \
: \
: "a"(0) \
: "bx", "cx", "dx")
unsigned long long start_tsc(void)
{
tsc_counter start;
CPUID();
RDTSC(start);
return COUNTER_VAL(start);
}
unsigned long long stop_tsc(unsigned long long start)
{
tsc_counter end;
RDTSC(end);
CPUID();
return COUNTER_VAL(end) - start;
}
代码有什么问题?我希望end
变量与num_runs
成正比,但这里不是。有什么想法吗?