将多个_mm128与_mm256的单个输入相乘

时间:2013-11-21 21:54:05

标签: intel sse intrinsics avx

我有8个_mm128寄存器,每个寄存器需要乘以另一个_mm256寄存器的单个条目。

我想到的一个解决方案是:

INPUT: __m128 a[8]; __m256 b;

__m128 tmp = _mm256_extractf128_ps(b,0);
a[0] = _mm_mul_ps(a[0],_mm_shuffle_ps(tmp,tmp,0));
a[1] = _mm_mul_ps(a[1],_mm_shuffle_ps(tmp,tmp,0x55));
a[2] = _mm_mul_ps(a[2],_mm_shuffle_ps(tmp,tmp,0xAA));
a[3] = _mm_mul_ps(a[3],_mm_shuffle_ps(tmp,tmp,0xFF));
tmp = _mm256_extractf128_ps(b,1);
a[4] = _mm_mul_ps(a[4],_mm_shuffle_ps(tmp,tmp,0));
a[5] = _mm_mul_ps(a[5],_mm_shuffle_ps(tmp,tmp,0x55));
a[6] = _mm_mul_ps(a[6],_mm_shuffle_ps(tmp,tmp,0xAA));
a[7] = _mm_mul_ps(a[7],_mm_shuffle_ps(tmp,tmp,0xFF));

实现这一目标的最佳方法是什么?谢谢。

1 个答案:

答案 0 :(得分:3)

我认为你的解决方案与它将要获得的一样好,除了我会使用显式变量而不是数组,以便所有内容尽可能保留在寄存器中:

__m128 a0, a1, a2, a3, a4, a5, a6, a7;
__m256 b;

__m128 tmp = _mm256_extractf128_ps(b,0);
a0 = _mm_mul_ps(a0, _mm_shuffle_ps(tmp,tmp,0));
a1 = _mm_mul_ps(a1, _mm_shuffle_ps(tmp,tmp,0x55));
a2 = _mm_mul_ps(a2, _mm_shuffle_ps(tmp,tmp,0xAA));
a3 = _mm_mul_ps(a3, _mm_shuffle_ps(tmp,tmp,0xFF));
tmp = _mm256_extractf128_ps(b,1);
a4 = _mm_mul_ps(a4, _mm_shuffle_ps(tmp,tmp,0));
a5 = _mm_mul_ps(a5, _mm_shuffle_ps(tmp,tmp,0x55));
a6 = _mm_mul_ps(a6, _mm_shuffle_ps(tmp,tmp,0xAA));
a7 = _mm_mul_ps(a7, _mm_shuffle_ps(tmp,tmp,0xFF));