我正在尝试使用SSE内在函数的8个浮点数组的每个元素,只是为了学习如何使用它们。但是,当我尝试这样写时:
alignas(16) float Numbers[8] =
{0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f};
__m128 Group1 = _mm_load_ps(Numbers);
__m128 Group2 = _mm_load_ps(Numbers + 4*sizeof(float));
__m128 Zero = _mm_setzero_ps();
__m128 Sum1 = _mm_add_ps(Group1, Group2); // Sum1 = Group1 + Group2
__m128 Sum2 = _mm_hadd_ps(Sum1, Zero); // Sum2[31:0] = Sum1[31:0] + Sum1[63:32]
// Sum2[63:32] = Sum1[95:64] + Sum1[127:96]
__m128 Sum3 = _mm_hadd_ps(Sum2, Zero); // Sum3[31:0] = Sum2[31:0] + Sum2[63:32]
float Result;
_mm_store_ss(&Result, Sum3);
Result
出现是6,当它应该是28时。我一直在指这些内在函数的参考,但我无法弄清楚我的逻辑在这里有什么问题。有什么建议吗?
答案 0 :(得分:5)
尝试更改此行
__m128 Group2 = _mm_load_ps(Numbers + 4*sizeof(float));
到
__m128 Group2 = _mm_load_ps(Numbers + 4);
(Numbers是float [],而不是char [])
答案 1 :(得分:2)
@twin已经指出了主要问题,但我想我只想补充几点:(a)你不需要零向量和(b)你不需要单独的和向量 - 你可以就地完成这一切,这应该更有效率。这是简化的代码,我用gcc测试过:
#include <stdio.h>
#include <pmmintrin.h>
int main()
{
float Numbers[8] __attribute__ ((aligned(16))) =
{0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f};
__m128 Group1 = _mm_load_ps(Numbers);
__m128 Group2 = _mm_load_ps(Numbers + 4);
__m128 Sum = _mm_add_ps(Group1, Group2);
Sum = _mm_hadd_ps(Sum, Sum);
Sum = _mm_hadd_ps(Sum, Sum);
float Result;
_mm_store_ss(&Result, Sum);
printf("Result = %g\n", Result);
return 0;
}
测试它:
$ gcc -Wall -msse3 sum_ps.c && ./a.out
Result = 28