Question

如何编写一个可移植的GNU C builtin vectors版本，它不依赖于x86 set1内在函数？

typedef uint16_t v8su __attribute__((vector_size(16)));

v8su set1_u16_x86(uint16_t scalar) {
    return (v8su)_mm_set1_epi16(scalar);   // cast needed for gcc
}

肯定有比

更好的方法

v8su set1_u16(uint16_t s) {
    return (v8su){s,s,s,s,  s,s,s,s};
}

我不想写一个用于广播单个字节的AVX2版本！

对于要分配给变量而不是仅用作二元运算符的操作数的情况，即使是对此部分仅使用gcc或仅限clang的答案也会很有趣适用于gcc，见下文）。

如果我想使用广播标量作为二元运算符的一个操作数，则可以使用gcc（as documented in the manual），但不能使用clang：

v8su vecdiv10(v8su v) { return v / 10; }   // doesn't compile with clang

使用clang，如果我只定位x86并且只使用本机矢量语法to get the compiler to generate modular multiplicative inverse constants and instructions for me，我可以写：

v8su vecdiv_set1(v8su v) {
    return v / (v8su)_mm_set1_epi16(10);   // gcc needs the cast
}

但是如果我将向量加宽（到_mm256_set1_epi16），我必须更改内在函数，而不是通过在一个位置更改为vector_size(32)将整个代码转换为AVX2（对于纯垂直SIMD）这不需要改组）。它也击败了本机向量的部分目的，因为它不会为ARM或任何非x86目标编译。

丑陋的演员阵容是必需的，因为与clang不同，gcc并不认为v8us {aka __vector(8) short unsigned int}与__m128i {aka __vector(2) long long int}兼容。

顺便说一下，所有这些都是用gcc和clang（see it on Godbolt）编译成好的asm。 这只是一个优雅写作的问题，可读语法不会重复标量N次。例如v / 10非常紧凑，甚至无需将其置于自己的功能中。

与ICC有效汇编是一个奖励，但不是必需的。 GNU C本机载体显然是ICC的事后想法，甚至是simple stuff like this doesn't compile efficiently。 set1_u16编译为8个标量存储和向量加载，而不是MOVD / VPBROADCASTW（启用-xHOST，因为它不识别-march=haswell，但Godbolt在服务器上运行AVX2支持）。纯粹转换_mm_内在函数的结果是可以的，但是除法调用SVML函数！

Answer 1

可以使用两个观察结果为GCC和Clang找到通用广播解决方案

Clang's OpenCL vector extensions和GCC的矢量扩展支持scalar - vector操作。
x - 0 = x（but x + 0 does not work due to signed zero）。

这是四个浮点数向量的解决方案。

#if defined (__clang__)
typedef float v4sf __attribute__((ext_vector_type(4)));
#else
typedef float v4sf __attribute__ ((vector_size (16)));
#endif

v4sf broadcast4f(float x) {
  return x - (v4sf){};
}

https://godbolt.org/g/PXr3Xb

相同的通用解决方案可用于不同的向量。这是一个八个无符号短路向量的例子。

#if defined (__clang__)
typedef unsigned short v8su __attribute__((ext_vector_type(8)));
#else
typedef unsigned short v8su __attribute__((vector_size(16)));
#endif

v8su broadcast8us(short x) {
  return x - (v8su){};
}

ICC（17）支持GCC向量扩展的子集，但不支持vector + scalar或vector*scalar，因此广播仍然需要内在函数。 MSVC不支持任何向量扩展。

GNU C本机向量：如何广播标量，如x86＆＃39; s _mm_set1_epi16

1 个答案: