我是SSE和SSE2的新手,我写了一个小C样本(分配两个计数器,一个增加其他减少而不是添加两个),这正在按预期工作。我使用了内在函数和Microsoft Visual Studio 10 C ++ Express。作为第二步,我想了解幕后发生了什么,但我现在感到困惑。 例如,for循环中的赋值操作编译为:
__m128i a_ptr = _mm_load_si128((__m128i*)&(a_aligned[i]));
mov eax,dword ptr [i]
mov ecx,dword ptr [a_aligned]
movdqa xmm0,xmmword ptr [ecx+eax*2]
movdqa xmmword ptr [ebp-1C0h],xmm0
movdqa xmm0,xmmword ptr [ebp-1C0h]
movdqa xmmword ptr [a_ptr],xmm0
据我所知,前两行获取a_aligned地址的组件,第三行将其复制到xmm0寄存器。但我不明白为什么它被复制回内存,而不是再次复制到xmm0(而不是a_ptr)。我虽然_mm_load_si128内在应该将a_aligned [i]的128位复制到xmm0而已。为什么会这样?理论上我错了吗?如果不是我应该怎么提示编译器?我的示例代码是否正确(从某种意义上说它没有不必要)? 这是我的完整示例代码:
#include <xmmintrin.h>
#include <emmintrin.h>
#include <iostream>
int main(int argc, char *argv[]) {
unsigned __int16 *a_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
unsigned __int16 *b_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
unsigned __int16 *c_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
for(int i = 0; i < 32; i++) {
a_aligned[i] = i;
b_aligned[i] = i;
c_aligned[i] = 0;
}
for(int i = 0; i < 32; i+=8) {
__m128i a_ptr = _mm_load_si128((__m128i*)&(a_aligned[i]));
__m128i b_ptr = _mm_load_si128((__m128i*)&(b_aligned[i]));
__m128i res = _mm_add_epi16(a_ptr, b_ptr);
_mm_store_si128((__m128i*)&(c_aligned[i]), res);
}
for(int i = 1; i < 32; i++) {
std::cout << c_aligned[i] << " ";
}
_mm_free(a_aligned);
_mm_free(b_aligned);
_mm_free(c_aligned);
return 0;
}
答案 0 :(得分:2)
显式设计了内在函数,以帮助编译器代码生成器更好地优化代码。您正在查看Debug配置生成的汇编代码。那不是优化代码。查看Release版本中的代码:
__m128i a_ptr = _mm_load_si128((__m128i*)&(a_aligned[i]));
011D10A0 movdqa xmm0,xmmword ptr [eax]
__m128i b_ptr = _mm_load_si128((__m128i*)&(b_aligned[i]));
011D10A4 movdqa xmm1,xmmword ptr [edx+eax]
__m128i res = _mm_add_epi16(a_ptr, b_ptr);
011D10A9 paddw xmm0,xmm1
_mm_store_si128((__m128i*)&(c_aligned[i]), res);
011D10AD movdqa xmmword ptr [ecx+eax],xmm0
看起来更好,不是吗?
答案 1 :(得分:1)
在编译器设置中启用优化(使用Release配置而不是Debug)。