Question

我想测试一个函数，以便验证哪一个更快，按值传递或按引用传递

代码：

struct Vec4f
{
  float val[4];
};


Vec4f suma(const Vec4f& a, const Vec4f& b)
{
  return {a.val[0] + b.val[0], 
          a.val[1] + b.val[1],
          a.val[2] + b.val[2],
          a.val[3] + b.val[3]};
}

Vec4f sumb(Vec4f a, Vec4f b)
{
  return {a.val[0] + b.val[0], 
          a.val[1] + b.val[1],
          a.val[2] + b.val[2],
          a.val[3] + b.val[3]};
}

使用-O3 -std=c++14 x86-64 clang上的汇编输出：

suma(Vec4f const&, Vec4f const&):                     # @suma(Vec4f const&, Vec4f const&)
        movq    xmm1, qword ptr [rdi]   # xmm1 = mem[0],zero
        movq    xmm0, qword ptr [rsi]   # xmm0 = mem[0],zero
        addps   xmm0, xmm1
        movq    xmm2, qword ptr [rdi + 8] # xmm2 = mem[0],zero
        movq    xmm1, qword ptr [rsi + 8] # xmm1 = mem[0],zero
        addps   xmm1, xmm2
        ret

sumb(Vec4f, Vec4f):                        # @sumb(Vec4f, Vec4f)
        addps   xmm0, xmm2
        addps   xmm1, xmm3
        ret

事实证明，gcc，clang和msvc在这种特殊情况下传递的值会产生更少的汇编。

我的问题是：

比较装配线计数通常是比较这些简单函数性能的良好启发式算法吗？

并且因为我不太了解装配输出

您能解释一下suma和sumb函数的汇编输出吗？

有趣的是，如果我将Vec4f更改为float val[40]，则两个函数都会生成相同的程序集输出。所以，

初始装配差异的原因是什么？

Answer 1

1）否。并非所有指令都在相同的时间内执行，一旦需要访问内存，可能会有很长的延迟。

2）和3）。 suma需要将a和b的内容加载到适当的寄存器中。在sumb中，这些值将传递给寄存器中已有的函数。在某些情况下，suma中的寄存器加载将由sumb的调用者完成。在其他情况下，值可能已经在寄存器中，suma调用者首先需要将这些值存储在内存中，以便它可以创建对它们的引用。

当你使用float val[40]超过寄存器传递值的容量时，所以两个函数都需要先从内存中加载数据（在suma中，通过取消引用引用;在{{1通过从堆栈中加载值来实现。）

Answer 2

1）也许这可以用作启发式方法，但它根本不可信任。例如，简单的div指令可能比20个简单指令慢。所以我根本不打算看指令计数。

2），3）

以下是您列出的装配的一个小解释：

clang只使用一半向量寄存器（xmmX可以包含4个浮点值，但clang只使用2）。也许是因为召集公约。

// this function has two reference parameters
// register rdi points to the first parameter (points to, so it is not the value of it, but a pointer)
// register rsi points to the second parameter
// register xmm0, xmm1 contains the result
suma(Vec4f const&, Vec4f const&):
        movq    xmm1, qword ptr [rdi]   # xmm1 will contain the first 2 floats of the first parameter
        movq    xmm0, qword ptr [rsi]   # xmm0 will contain the first 2 floats of the second parameter
        addps   xmm0, xmm1              # let's add them together, xmm0 contains the result
        movq    xmm2, qword ptr [rdi + 8] # xmm2 will contain the second 2 floats of the first parameter
        movq    xmm1, qword ptr [rsi + 8] # xmm1 will contain the second 2 floats of the second parameter
        addps   xmm1, xmm2              # let's add them together, xmm1 contains the result
        ret

// this function has to parameters
// first is passed in xmm0 and xmm1
// seconds is passed in xmm2 and xmm3
// register xmm0, xmm1 contains the result
sumb(Vec4f, Vec4f):
        addps   xmm0, xmm2
        addps   xmm1, xmm3
        ret

如果我将Vec4f更改为float val[40]，则两个函数都会生成相同的程序集输出。

这是假的。他们没有。它们乍一看似乎是一样的，但它们不是。

两个函数中的代码都是相同的：因为你返回一个float[40]，它有很多零成员，所以两个函数中都应该有代码将这些元素归为零。你看到那个代码，它是一样的。其他部分不同。

通过查看程序集

2 个答案: