Question

以下代码显示了我的计算机上min_3的两个版本的巨大性能差异（Windows 7，VC ++ 2015，发行版）。

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>

template <typename X>
const X& max_3_left( const X& a, const X& b, const X& c )
{
    return std::max( std::max( a, b ), c );
}

template <typename X>
const X& max_3_right( const X& a, const X& b, const X& c )
{
    return std::max( a, std::max( b, c ) );
}

int main()
{
    std::random_device r;
    std::default_random_engine e1( r() );
    std::uniform_int_distribution<int> uniform_dist( 1, 6 );
    std::vector<int> numbers;
    for ( int i = 0; i < 1000; ++i )
        numbers.push_back( uniform_dist( e1 ) );

    auto start1 = std::chrono::high_resolution_clock::now();
    int sum1 = 0;
    for ( int i = 0; i < 1000; ++i )
        for ( int j = 0; j < 1000; ++j )
            for ( int k = 0; k < 1000; ++k )
                sum1 += max_3_left( numbers[i], numbers[j], numbers[k] );
    auto finish1 = std::chrono::high_resolution_clock::now();
    std::cout << "left  " << sum1 << " " <<
        std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1).count()
        << " us" << std::endl;

    auto start2 = std::chrono::high_resolution_clock::now();
    int sum2 = 0;
    for ( int i = 0; i < 1000; ++i )
        for ( int j = 0; j < 1000; ++j )
            for ( int k = 0; k < 1000; ++k )
                sum2 += max_3_right( numbers[i], numbers[j], numbers[k] );
    auto finish2 = std::chrono::high_resolution_clock::now();
    std::cout << "right " << sum2 << " " <<
        std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2).count()
        << " us" << std::endl;
}

输出：

left  739861041 796056 us
right 739861041 1442495 us

在ideone上，差异较小但仍不可忽略。

为什么存在这种差异？

Answer 1

gcc和clang（可能是MSVC）没有意识到max是一个像添加一样的关联操作。 v[i] max (v[j] max v[k])（max_3_right）与(v[i] max v[j]) max v[k]（max_3_left）相同。我正在编写max作为中缀运算符，以指出与+和其他关联操作的相似性。

由于v[k]是唯一在内循环内部发生变化的输入，因此将(v[i] max v[j])提升出内循环显然是一个很大的胜利。

要了解实际发生的情况，我们始终要看看asm。为了便于找到循环的asm，I split them out into separate functions。（使用max3函数作为参数使一个模板函数更像是C ++。这样做的另一个好处是可以从main，which gcc marks as "cold", disabling some optimizations中获取我们想要优化的代码。

#include <algorithm>
#define SIZE 1000
int sum_maxright(const std::vector<int> &v) {
    int sum = 0;
    for ( int i = 0; i < SIZE; ++i )
        for ( int j = 0; j < SIZE; ++j )
            for ( int k = 0; k < SIZE; ++k )
                sum += max_3_right( v[i], v[j], v[k] );
    return sum;
}

编译的最内层循环（gcc 5.3使用-std=gnu++11 -fverbose-asm -O3 -fno-tree-vectorize -fno-unroll-loops -march=haswell定位x86-64 Linux ABI并附带一些手注释）

## from outer loops: rdx points to v[k] (starting at v.begin()).  r8 is v.end().  (r10 is v.begin)
## edi is v[i], esi is v[j]
## eax is sum

 ## inner loop.  See the full asm on godbolt.org, link below
.L10:
        cmp     DWORD PTR [rdx], esi      # MEM[base: _65, offset: 0], D.92793
        mov     ecx, esi                  # D.92793, D.92793
        cmovge  ecx, DWORD PTR [rdx]      # ecx = max(v[j], v[k])
        cmp     ecx, edi      # D.92793, D.92793
        cmovl   ecx, edi      # ecx = max(ecx, v[i])
        add     rdx, 4    # pointer increment
        add     eax, ecx  # sum, D.92793
        cmp     rdx, r8   # ivtmp.253, D.92795
        jne     .L10      #,

Clang 3.8为max_3_right循环生成类似的代码，内循环内有两条cmov指令。（使用 Godbolt Compiler Explorer 中的编译器下拉列表查看。）

gcc和clang都优化了你对max_3_left循环所期望的方式，从内循环中提升除了cmov之外的所有东西。

## register allocation is slightly different here:
## esi = max(v[i], v[j]).    rdi = v.end()
.L2:
        cmp     DWORD PTR [rdx], ecx      # MEM[base: _65, offset: 0], D.92761
        mov     esi, ecx  # D.92761, D.92761
        cmovge  esi, DWORD PTR [rdx]        # MEM[base: _65, offset: 0],, D.92761
        add     rdx, 4    # ivtmp.226,
        add     eax, esi  # sum, D.92761
        cmp     rdx, rdi  # ivtmp.226, D.92762
        jne     .L2       #,

所以在这个循环中进行的更少。（在英特尔前Broadwell上，cmov是一个2-uop指令，所以少cmov是一个大问题。）

BTW，缓存预取效果无法解释这个：

内循环依次访问numbers[k]。任何体面的编译器都会对内部循环重复访问numbers[i]和numbers[j]，并且不会混淆现代预取程序，即使它们不是。

Intel's optimization manual表示，对于Sandybridge系列微体系结构，可以检测和维护多达32个预取模式流（每4k页限制一个前向和一个后向）（ 2.3.5.4节数据预取）。

OP完全没有说明他运行这个微基准测试的硬件是什么，但是由于真正的编译器提升其他负载只留下最微不足道的访问模式，所以它几乎不重要。
1000 vector s（4B）中的一个int仅需要4kiB。这意味着整个阵列很容易适应L1D缓存，因此首先不需要任何类型的预取。它几乎在整个时间内都在L1缓存中保持热销。

Answer 2

正如molbdnilo指出的那样，问题可能在于循环的顺序。在计算sum1时，代码可以重写为：

for ( int i = 0; i < 1000; ++i )
   for ( int j = 0; j < 1000; ++j ) {
      auto temp = std::max(numbers[i], numbers[j]);
      for ( int k = 0; k < 1000; ++k )
            sum1 += std::max(temp, numbers[k]);
   }

同样不能用于sum2的计算。但是，当我将第二个循环重新编译为：

for ( int j = 0; j < 1000; ++j )
   for ( int k = 0; k < 1000; ++k )
      for ( int i = 0; i < 1000; ++i )
         sum2 += ...;

两次计算都得到了相同的时间。（此外，-O3和-O2的计算速度都快得多。前者似乎根据反汇编输出打开矢量化。）

Answer 3

这与硬件级别的数据cache prefetching有关。

如果使用左关联版本，则按CPU缓存所需的顺序使用/加载数组的元素，并减少延迟。

右关联版本会破坏预测，并会产生更多缓存未命中，因此性能会降低。

最多3个值，左关联版本与右关联版本的性能

3 个答案: