Question

我正在寻找一种快速计算含有3或4个成分的向量的点积的方法。我尝试了几件事，但大多数在线示例都使用了一系列浮点数，而我们的数据结构则不同。

我们使用16字节对齐的结构。代码摘录（简化）：

struct float3 {
    float x, y, z, w; // 4th component unused here
}

struct float4 {
    float x, y, z, w;
}

在之前的测试中（使用SSE4点积本征或FMA）与使用以下常规c ++代码相比，我无法获得加速。

float dot(const float3 a, const float3 b) {
    return a.x*b.x + a.y*b.y + a.z*b.z;
}

在英特尔Ivy Bridge / Haswell上使用gcc和clang进行了测试。似乎花费时间将数据加载到SIMD寄存器并再次将其拉出会消耗所有好处。

我很感激一些帮助和想法，如何使用我们的float3 / 4数据结构有效地计算点积。 SSE4，AVX甚至AVX2都没问题。

提前致谢。

Answer 1

代数上，高效的SIMD看起来几乎与标量代码完全相同。因此，制作点积的正确方法是一次操作四个浮点矢量用于SEE（八个用AVX）。

考虑像这样构建你的代码

#include <x86intrin.h>

struct float4 {
    __m128 xmm;
    float4 () {};
    float4 (__m128 const & x) { xmm = x; }
    float4 & operator = (__m128 const & x) { xmm = x; return *this; }
    float4 & load(float const * p) { xmm = _mm_loadu_ps(p); return *this; }
    operator __m128() const { return xmm; }
};

static inline float4 operator + (float4 const & a, float4 const & b) {
    return _mm_add_ps(a, b);
}
static inline float4 operator * (float4 const & a, float4 const & b) {
    return _mm_mul_ps(a, b);
}

struct block3 {
    float4 x, y, z;
};

struct block4 {
    float4 x, y, z, w;
};

static inline float4 dot(block3 const & a, block3 const & b) {
    return a.x*b.x + a.y*b.y + a.z*b.z;
}

static inline float4 dot(block4 const & a, block4 const & b) {
    return a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
}

请注意，最后两个函数看起来几乎与标量dot函数相同，只是float变为float4且float4变为block3或{{1 }}。这将最有效地完成点积。

Answer 2

为了充分利用AVX内在函数，您必须从不同的角度思考。而不是做一个点产品，一次做8个点产品。

查看SoA and AoS之间的区别。如果你的向量是SoA（数组结构）格式，你的数据在内存中看起来像这样：

// eight 3d vectors, called a.
float ax[8];
float ay[8];
float az[8];

// eight 3d vectors, called b.
float bx[8];
float by[8];
float bz[8];

然后将所有8个向量与所有8个b向量相乘，使用三个simd乘法，x，y，z各一个。

对于dot，你当然还需要添加，当然，这有点棘手。但是使用SoA的乘法，减法，向量的添加非常简单，并且非常快。当AVX-512可用时，您可以在3条指令中进行16次3d矢量乘法。

使用SSE / AVX内在函数的快点产品

2 个答案: