Question

我遇到了这样一个问题，即如何使用“最终”关键字来减少虚拟方法的开销（Virtual function efficiency and the 'final' keyword）。基于此答案，期望派生类指针调用带有final标记的重写方法将不会面临动态分配的开销。

为了测试该方法的好处，我设置了一些示例类，并在Quick-Bench-Here is the link上运行了它。这里有3种情况：
案例1 ：没有最终说明符的派生类指针：

Derived* f = new DerivedWithoutFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

案例2 ：带有最终说明符的基类指针：

Base* f = new DerivedWithFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

案例3 ：带有最终说明符的派生类指针：

Derived* f = new DerivedWithFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

函数run_multiple如下所示：

int run_multiple(int times) specifiers {
    int sum = 0;
    for(int i = 0; i < times; i++) {
        sum += run_once();
    }
    return sum;
}

我观察到的结果是：
按速度：案例2 ==案例3>案例1

但是案例3的速度不应该比案例2快得多吗？我的实验设计或对预期结果的假设是否有问题？

编辑： 彼得·科德斯（Peter Cordes）指出了一些与该主题相关的非常有用的文章，可供进一步阅读： Is final used for optimization in C++?
Why can't gcc devirtualize this function call?
LTO, Devirtualization, and Virtual Tables

Answer 1

您正确地理解了final的影响（情况2的内部循环除外），但是您的成本估算却遥遥无期。我们不应该期望在任何地方产生大的影响，因为mt19937只是速度很慢，并且所有3个版本都在其中花费了大部分时间。

唯一不会丢失/掩埋在噪音/开销中的事情是将int run_once() override final内联到FooPlus::run_multiple的 inner 循环中的效果，这两种情况均如此并运行案例3。

但是情况1无法将Foo::run_once()内联到Foo::run_multiple()中，因此与其他两种情况不同，内部循环内部存在函数调用开销。

第2种情况必须反复调用run_multiple，但是每运行run_once一次只能调用一次，并且没有可测量的效果。

对于所有3种情况，最多花费的时间是dist(rng);，因为与不内联函数调用的额外开销相比，std::mt19937相当慢。乱序执行也可能会隐藏很多开销。但并非全部，因此仍有一些要测量的地方。

案例3能够将所有内容都内联到该asm循环中（通过您的quickbench链接）：

 # percentages are *self* time, not including time spent in the PRNG
 # These are from QuickBench's perf report tab,
 #  presumably sample for core clock cycle perf events.
 # Take them with a grain of salt: superscalar + out-of-order exec
 #  makes it hard to blame one instruction for a clock cycle

   VirtualWithFinalCase2(benchmark::State&):   # case 3 from QuickBench link
     ... setup before the loop
     .p2align 3
    .Louter:                # do{
       xor    %ebp,%ebp          # sum = 0
       mov    $0x64,%ebx         # inner = 100
     .p2align 3  #  nopw   0x0(%rax,%rax,1)
     .Linner:                    # do {
51.82% mov    %r13,%rdi
       mov    %r15,%rsi
       mov    %r13,%rdx           # copy args from call-preserved regs
       callq  404d60              # mt PRNG for unsigned long
47.27% add    %eax,%ebp           # sum += run_once()
       add    $0xffffffff,%ebx    # --inner
       jne    .Linner            # }while(inner);
       mov    %ebp,0x4(%rsp)     # store to volatile local:  benchmark::DoNotOptimize(x);
0.91%  add    $0xffffffffffffffff,%r12   # --outer
       jne                    # } while(outer)

案例2仍然可以将run_once内联到run_multiple ，因为class FooPlus使用int run_once() override final。外循环中只有虚拟调度开销（仅），但是每次外循环迭代所产生的少量额外费用与内循环的成本（在案例2和案例3之间完全相同）完全相形见

因此，内部循环本质上是相同的，仅在外部循环中具有间接调用开销。毫不奇怪，这是无法测量的，或者至少在Quickbench上的噪声中消失了。

案例1无法将Foo::run_once()内联到Foo::run_multiple()中，因此那里也存在函数调用开销。（它是间接函数调用的事实相对较小；在紧密循环中，分支预测将完成近乎完美的工作。）

如果您查看Quick-Bench链接上的反汇编，案例1和案例2的外部循环具有相同的组合。

任何人都不能对run_multiple进行虚拟化和内联。情况1是因为它是虚拟的非最终值，情况2是因为它只是基类，而不是具有final覆盖的派生类。

        # case 2 and case 1 *outer* loops
      .loop:                 # do {
       mov    (%r15),%rax     # load vtable pointer
       mov    $0x64,%esi      # first C++ arg
       mov    %r15,%rdi       # this pointer = hidden first arg
       callq  *0x8(%rax)      # memory-indirect call through a vtable entry
       mov    %eax,0x4(%rsp)  # store the return value to a `volatile` local
       add    $0xffffffffffffffff,%rbx      
       jne    4049f0 .loop   #  } while(--i != 0);

这可能是错过的优化方法：编译器可以证明Base *f来自new FooPlus()，因此静态已知其类型为FooPlus 。 operator new可以被覆盖，但是编译器仍然会向FooPlus::FooPlus()发出单独的调用（将指向new的存储的指针传递给它）。因此，这似乎只是在案例2和案例1中都没有利用的c语。

使用final减少虚拟方法的开销

1 个答案: