Question

我正在做一个程序，它接受两个矩阵4x4并使用Intrinsics将它们相乘。直到现在我才明白：

MMX / SSE指令集允许您加速计算。特别是它使用4字节元素向量。
__m128表示16字节向量（4个字节的4个元素）。此外，__m128数据需要对齐以便工作。

我迷路的地方在这里：

函数_mm_mul_ps(_m128, _m128)（如我所读）采用两个向量的16个字节的4个4字节的4个字节。它将两个向量乘以“一对一”并返回_m128。但是，_m128向量究竟包含什么（ what 的结果）？
函数_mm_hadd_ps(_m128, _m128)添加两个16字节向量（每个4字节浮点数）。它以这种方式“增添了视野”：
vectorA(a1, a2, a3,a4) + vectorB(b1, b2, b3, b4) = vectorResult(a1 + a2, a3 + a4, b1 + b2, b3 + b4)

我想做什么：

// Stores the result of multiply on row of A by one column of B
    _declspec (align(16)) __m128 aux; 

        // Horizontal add
        for(int i = 0; i < 4; i++){
            for (int j = 0; j < 4; j++){
                aux= _mm_mul_ps(vectorA[i], vectorB[j]);
                // Add results
                aux = _mm_hadd_ps(aux, aux);
                aux = _mm_hadd_ps(aux,aux);
            }
        }

我不能看这些功能如何运作（我没有“心理形象”）。

_mm_mul_ps（）如何添加两个__m128？

0 个答案: