Question

所以我偶然发现了一些我想要理解的东西，因为它让我感到头痛。我有以下代码：

#include <stdio.h>
#include <smmintrin.h>

typedef union {
    struct { float x, y, z, w; } v;
    __m128 m;
} vec;

vec __attribute__((noinline)) square(vec a)
{
    vec x = { .m = _mm_mul_ps(a.m, a.m) };
    return x;
}

int main(int argc, char *argv[])
{
    float f = 4.9;
    vec a = (vec){f, f, f, f};
    vec res = square(a); // ?
    printf("%f %f %f %f\n", res.v.x, res.v.y, res.v.z, res.v.w);
    return 0;
}

现在，在我看来，square中对main的调用应该将a的值放在xmm0中，以便square函数可以执行mulps xmm0, xmm0并完成它。

这不是我用clang或gcc编译时会发生的事情。相反，a的前8个字节放在xmm0中，接下来的8个字节放在xmm1中，这使得square函数变得更复杂，因为它需要修补东西备份。

知道为什么吗？

注意：这是-O3优化。

经过进一步研究，似乎与联合类型有关。如果函数采用直__m128，则生成的代码将期望单个寄存器中的值（xmm0）。但鉴于它们都应该适合xmm0，我不明白为什么在使用vec类型时它被分成两个半使用的寄存器。

Answer 1

编译器只是试图遵循 System V应用程序二进制接口AMD64架构处理器补充程序3.2.3参数传递中指定的调用约定。

相关要点是：

We first define a number of classes to classify arguments. The
classes are corresponding to AMD64 register classes and defined as:

SSE The class consists of types that fit into a vector register.

SSEUP The class consists of types that fit into a vector register and can
be passed and returned in the upper bytes of it.

The size of each argument gets rounded up to eightbytes.
The basic types are assigned their natural classes:
Arguments of types float, double, _Decimal32, _Decimal64 and __m64 are
in class SSE.

The classification of aggregate (structures and arrays) and union types
works as follows:

If the size of the aggregate exceeds a single eightbyte, each is
classified separately.

应用上述规则意味着嵌入式结构的x, y和z, w对分别被分类为SSE类，这意味着它们必须在两个单独的寄存器中传递。在这种情况下，m成员的存在不会产生任何影响，您甚至可以将其删除。

Answer 2

编辑：在第二次阅读时，我不太确定为什么会这样，但我更确定这就是它发生的地方。我不认为这个答案是对的，但我会把它留下来，因为它可能会有所帮助。

只针对clang说：

这似乎是一个问题，只是编译器启发式的一个不幸的副作用。

简要介绍一下clang（文件CGRecordLayoutBuilder.cpp，函数CGRecordLowering::lowerUnion），看起来llvm在内部并不代表联合类型，而且函数的类型并不是这样。根据功能中的用途进行更改。

clang查看你的函数并发现它需要16个字节的类型签名参数，然后使用启发式来选择它认为最好的类型。它有利于{ double, double } <4 x float>对void CGRecordLowering::lowerUnion() { ... // Conditionally update our storage type if we've got a new "better" one. if (!StorageType || getAlignment(FieldType) > getAlignment(StorageType) || (getAlignment(FieldType) == getAlignment(StorageType) && getSize(FieldType) > getSize(StorageType))) StorageType = FieldType; ... }的解释（在你的情况下会给它带来最大的效率），因为双精度在对齐方面更宽松。

我没有关于铿锵声内部的专家，所以我可能会非常错误，但看起来并不是一个特别好的方法。如果你想要优化版本，你可能必须使用指针转换而不是联合来获得它。

我怀疑的代码导致了问题：

{{1}}

为什么gcc / clang使用两个128位xmm寄存器来传递单个值？

2 个答案: