我对std::inner_product()
与手动点积计算的比较感兴趣,所以我做了一个测试。
std::inner_product()
比手动实施快4倍。我发现这很奇怪,因为目前还没有很多方法可以计算它,当然?!我也看不到在计算点使用任何SSE / AVX寄存器。
设置:VS2013 / MSVC(12?),Haswell i7 4770 CPU,64位编译,发布模式。
这是C ++测试代码:
#include <iostream>
#include <functional>
#include <numeric>
#include <cstdint>
int main() {
const int arraySize = 1000;
const int numTests = 500;
unsigned int x, y = 0;
unsigned long long* array1 = new unsigned long long[arraySize];
unsigned long long* array2 = new unsigned long long[arraySize];
//Initialise arrays
for (int i = 0; i < arraySize; i++){
unsigned long long val = __rdtsc();
array1[i] = val;
array2[i] = val;
}
//std::inner_product test
unsigned long long timingBegin1 = __rdtscp(&s);
for (int i = 0; i < numTests; i++){
volatile unsigned long long result = std::inner_product(array1, array1 + arraySize, array2, static_cast<uint64_t>(0));
}
unsigned long long timingEnd1 = __rdtscp(&s);
f, s = 0;
//Manual Dot Product test
unsigned long long timingBegin2 = __rdtscp(&f);
for (int i = 0; i < numTests; i++){
volatile unsigned long long result = 0;
for (int i = 0; i < arraySize; i++){
result += (array1[i] * array2[i]);
}
}
unsigned long long timeEnd2 = __rdtscp(&f);
std::cout << "STL: : " << static_cast<double>(finish1 - start1) / numTests << " CPU cycles per dot product" << std::endl;
std::cout << "Manually : " << static_cast<double>(finish2 - start2) / numTests << " CPU cycles per dot product" << std::endl;
答案 0 :(得分:3)
你的考试很糟糕,这可能会产生很大的不同。
volatile uint64_t result = 0; for (int i = 0; i < arraySize; i++){ result += (array1[i] * array2[i]);
请注意您是如何在此处持续使用volatile
限定变量的。这会强制编译器将临时结果写入内存。
相比之下,您的inner_product
版本:
volatile uint64_t result = std::inner_product(array1, array1 + arraySize, array2, static_cast<uint64_t>(0));
首先计算内积,允许优化,然后将结果分配给volatile
- 限定变量。